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Introduction -- Sampling and Data -- MtRoyal - Version2016RevA 
class="introduction" 


We 
encounte 
i 
Statistics 
in our 
daily 
lives 
more 
often 
than we 
probably 
realize 
and from 
many 
different 
sources, 
like the 


news. 

(credit: 

David 
Sim) 


Note: 
Chapter objective 
By the end of this chapter, the student should be able to: 


e Recognize and differentiate between key terms. 
e Apply various types of sampling methods to data collection. 


You are probably asking yourself the question, "When and where will I use 
Statistics?" If you read any newspaper, watch television, or use the Internet, 
you will see statistical information. There are statistics about crime, sports, 
education, politics, and real estate. Typically, when you read a newspaper 
article or watch a television news program, you are given sample 
information. With this information, you may make a decision about the 
correctness of a statement, claim, or "fact." Statistical methods can help you 
make the "best educated guess." 


Since you will undoubtedly be given statistical information at some point in 
your life, you need to know some techniques for analyzing the information 
thoughtfully. Think about buying a house or managing a budget. Think 
about your chosen profession. The fields of economics, business, 
psychology, education, biology, law, computer science, police science, and 
early childhood development require at least one course in Statistics. 


Included in this chapter are the basic ideas and words of probability and 
Statistics. You will soon understand that statistics and probability work 
together. You will also learn how data are gathered and what "good" data 
can be distinguished from "bad." 


Definitions of Statistics, Probability, and Key Terms -- MRU - C Lemieux 
(2017) 


The science of statistics deals with the collection, analysis, interpretation, 
and presentation of data. 


The process of statistical analysis follows these broad steps. 


1. Defining the problem 

2. Planning the study 

3. Collecting the data for the study 

4. Analysis of the data 

5. Interpretations and conclusions based on the analysis 


For example, we may wonder if there is a gap between how much men and 
women are paid for doing the same job. This would be the problem we want 
to investigate. Before we do the investigation, we would want to spend 
some time defining the problem. This could include defining terms (e.g. 
what do we mean by “paid”? what constitutes the “same job”?). Then we 
would want to state a research question. A research question is the 
overarching question that the study aims to address. In this example, our 
research question might be: “Does the gender wage gap exist?”. 


Once we have the problem clearly defined, we need to figure out how we 
are going to study the problem. This would include determining how we are 
going to collect the data for the study. Since it is unlikely we are going to 
find out the salary and position of every employee in the world (i.e. the 
population), we need to instead collect data from a subset of the whole (i.e. 
a sample). The process of how we will collect the data is called the 
sampling technique. The overall plan of how the study is designed is 
called the sampling design or methodology. 


Once we have the methodology, we want to implement it and collect the 
actual data. 


When we have the data, we will learn how to organize and summarize data. 
Organizing and summarizing data is called descriptive statistics. Two ways 
to summarize data are by visually summarizing the data (for example, a 


histogram) and by numerically summarizing the data (for example, the 
average). After we have summarized the data, we will use formal methods 
for drawing conclusions from "good" data. The formal methods are called 
inferential statistics. Statistical inference uses probability to determine 
how confident we can be that our conclusions are correct. 


Once we have summarized and analyzed the data, we want to see what kind 
of conclusions we can draw. This would include attempting to answer the 
research question and recognizing the limitations of the conclusions. 


In this course, most of our time will be spent in the last two steps of the 
Statistical analysis process (i.e. organizing, summarizing and analyzing 
data). To understand the process of making inferences from the data, we 
must also learn about probability. This will help us understand the 
likelihood of random events occurring. 


Key Idea and Terms 


In statistics, we generally want to study a population. You can think of a 
population as a collection of persons, things, or objects under study. You 
can think of a population as a collection of persons, things, or objects under 
study. The person, thing or object under study (i.e. the object of study) is 
called the observational unit. What we are measuring or observing about 
the observational unit is called the variable. We often use the letters X or Y 
to represent a variable. A specific instance of a variable is called data. 


Example: 

Suppose our research question is “Do current NHL forwards who make 
over $3 million a year score, on average, more than 20 points a season?” 
The population would be all of the NHL forwards who make over $3 
million a year and who are currently playing in the NHL. The 
observational unit is a single member of the population, which would be 
any forward that made over $3 million year. The variable is what we are 
studying about the observation unit, which is the number points a forward 


in the population gets in a season. A data value would be the actual number 
of points. 


In the above example, it would be reasonable to look at the population 
when doing the statistical analysis as the population is very well defined, 
there are many websites that have this information readily available, and the 
population size is relatively small. But this is not always the case. For 
example, suppose you want to study the average profits of oil and gas 
companies in the world. This might be very hard to get a list of all of the oil 
and gas companies in the world and get access to their financial reports. 
When the population is not easily accessible, we instead look at a sample. 
The idea of sampling (the process of collecting the sample) is to select a 
portion (or subset) of the larger population and study that portion (the 
sample) to gain information about the population. 


Because it takes a lot of time and money to examine an entire population, 
sampling is a very practical technique. If you wished to compute the overall 
grade point average at your school, it would make sense to select a sample 
of students who attend the school. The data collected from the sample 
would be the students’ grade point averages. In federal elections, opinion 
poll samples of 1,000—2,000 people are taken. The opinion poll is supposed 
to represent the views of the people in the entire country. Manufacturers of 
canned carbonated drinks take samples to determine if a 16 ounce can 
contains 16 ounces of carbonated drink. 


It is important to note that though we might not know the population, when 
we decide to sample from it, it is fairly static. Going back to the example of 
the NHL forwards, if we were to gather the data for the population right 
now that would be our fixed population. But if you took a sample from that 
population and your friend took a sample from that population, it is not 
surprising that you and your friend would get a different sample. That is, 
there is one population, but there are many, many different samples that can 
be drawn from the sample. How the samples vary from each other is called 
sampling variability. The idea of sampling variability is a key concept in 
Statistics and we will come back to it over and over again. 


Note:Data is plural. Datum is singular. 


As mentioned above, a variable, or random variable, notated by capital 
letters such as X and Y, is a characteristic of interest for each person or 
thing in a population. Data are the actual values of the variable. Data and 
variables fall into two general types: either they are measuring something 
and they are not measuring. When a variable is measuring or counting 
something, it is called a quantitative variable and the data is called 
quantitative. When a variable is not measuring or counting something, it is 
called a categorical variable and the data is called categorical data. For a 
variable to be considered quantitative, the distance between each number 
has to be fixed. In general, quantitative variables measure something and 
take on values with equal units such as weight in pounds or number of 
people in a line. Categorical variables place the person or thing into a 
category such as colour of car or opinion on topic. 


Example: 


e In the NHL forwards example, the variable is quantitative as we 
investigating the number of points a player has. 

e Inthe gender gap example, there were three variables: the salary, 
gender, and the position. The salary is a quantitative variable as we 
are investigating the amount people make. Gender is a categorical 
variable as we are categorizing someone’s gender. Position is also 
categorical as we are categorizing their type of employment. 

e Sometimes though determining the type of a variable (i.e. quantitative 
or categorical) is not always cut and dry. In particular, Likert scales 
or rating scales are tricky to place. A Likert scale is any scale where 
you are asked to state your opinion on a scale. For example, you may 
be asked whether you strongly agree, agree, neutral, disagree or 
strongly disagree with a statement. Sometimes there is a number 
associated with the rating. For example, write 5 if you strongly agree 
and 1 if you strongly disagree. Technically, a Likert scale is a 


categorical data as we are categorizing people’s opinions and the 
number is just a short form for the category. 


Note:When you are asked to categorize the data or variable, first 
determine what the observation unit is. Then determine the variable being 
studied. Then think about what the data will look like. If the data is a 
number, then it is usually quantitative data (be wary of Likert scales). If 
the data is word or category, then it is categorical data. 


Exercise: 


Problem: 


For the following research questions, state the observational unit, the 
variable being studied, and the type of variable. 


a. What is the average monthly temperature in Edmonton? 

b. What is the highest belt colour that most students of karate earn in 
Canada? 

c. What is the average weight of greyhound dogs? 

d. What is the average gross profit of movies made in 2016? 

e. What is the average user rating of Jessica Jones season 1 on 
IMDB? 

f. What is the most common colour of car in Nova Scotia? 


Solution: 


a. Observational unit: Edmonton. Variable: Monthly Temperature. 
Type: Quantitative. 

b. Observational unit: Student of karate in Canada. Variable: Highest 
colour of belt earned. Type: Categorical. 


c. Observational unit: Greyhounds. Variable: Weight. Type: 
Quantitative. 

d. Observational unit: Movies made in 2016. Variable: Gross profit. 
Type: Quantitative. 

e. Observational unit: Jessica Jones. Variable: User ratings. Type: 
Categorical. 

f. Observational unit: Cars in Nova Scotia. Variable: Colour. Type: 
Categorical. 


Two words that come up often in statistics are mean and proportion. These 
are two example of numerical descriptive statistics. If you were to take 
three exams in your math classes and obtain scores of 86, 75, and 92, you 
would calculate your mean score by adding the three exam scores and 
dividing by three (your mean score would be 84.3 to one decimal place). If, 
in your math class, there are 40 students and 22 are men then the proportion 
of men in the course is 55% and the proportion of women is 45%. 


From the sample data, we can calculate a statistic. A statistic is a numerical 
summary that represents a property of the sample. For example, if we 
consider one math class to be a sample of the population of all math classes, 
then the mean number of points earned by students in that one math class at 
the end of the term is an example of a statistic. The statistic is an estimate of 
a population parameter, in this case the mean. A parameter is a numerical 
summary that represents a property of the population. Since we considered 
all math classes to be the population, then the mean number of points 
earned per student over all the math classes is an example of a parameter 
(i.e. the population mean). If we took a sample of students from the math 
class and found the mean points earned per student in the sample, then we 
would have found a statistic (i.e. the sample mean). 


Example: 

In the NHL example, a sample of the population may be 31 forwards who 
make over $3 million per year. The sample was chosen by randomly 
choosing one forward who makes over $3 million from each team (if you 


are reading this after Sept. 2021, this would be changed to 32). The process 
of choosing the sample is called sampling. We would then collect the data 
for the sample, which would be the number of points each player in our 
sample gets in one season. The statistic would be the mean of the total 
number of points for the sample. The parameter at this point would be 
unknown, but we could estimate it with our statistic. To find the parameter, 
we would have to find the mean of the total number of points for the 
population. 


One of the main concerns in the field of statistics is how accurately a 
Statistic estimates a parameter. The accuracy really depends on how well the 
sample represents the population. The sample must contain the 
characteristics of the population in order to be a representative sample. We 
are interested in both the sample statistic and the population parameter in 
inferential statistics. In a later chapter, we will use the sample statistic to 
test the validity of the established population parameter. 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. We want 
to know what proportion of first-year students get to ABC college 


using public transit. We randomly survey 100 first year students at 
ABC college. 


Solution: 


The population is all first year students attending ABC college this 
term. 


The sample depends on how we choose the students. One possible 
answer could be all students enrolled in one section of a beginning 


Statistics course at ABC College (although this sample would not be 
deemed random nor representative of the entire population). 


The variable would be whether a first-year student uses public 
transportation to get to ABC college or not. 


The data are the actual values of the variable. As students would 
either use public transportation or not, the data would be "yes" or "no, 
or "public transporation" or "not public transportation" (depending on 
how you chose to represent your data). 


The statistic is the proportion of students in your SAMPLE who use 
public transportation to get to ABC college. (Note: The mean would 
not be an appropriate summary here as you cannot find the mean of 
categorical data). 


The parameter is the proportion of ALL first-year students who use 
public transportation to get to ABC college. 


Note: 
Try It 
Exercise: 


Problem: 


Determine what the key terms refer to in the following study. We want 
to know the average (mean) amount of money spent on school 
uniforms each year by families with children at Knoll Academy. We 
randomly survey 100 families with children in the school. Three of 
the families spent $65, $75, and $95, respectively. 


Solution: 
Try It Solutions 


The population is all families with children attending Knoll 
Academy. 


The sample is a random selection of 100 families with children 
attending Knoll Academy. 


The parameter is the average (mean) amount of money spent on 
school uniforms by families with children at Knoll Academy. 


The statistic is the average (mean) amount of money spent on school 
uniforms by families in the sample. 


The variable is the amount of money spent by one family. Let X = the 
amount of money spent on school uniforms by one family with 
children attending Knoll Academy. 


The data are the dollar amounts spent by the families. Examples of 
the data are $65, $75, and $95. 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. 


A study was conducted at a local college to analyze the average 
cumulative GPA’s of students who graduated last year. Fill in the letter 
of the phrase that best describes each of the items below. 


ie Population 2. Statistic 3. Parameter 4. 
Sample 5. Variable 6. Data 


e a) all students who attended the college last year 

e b) the cumulative GPA of one student who graduated from the 
college last year 

eC) 3.00,2.00) 12:50) 3.90 

e d)a group of students who graduated from the college last year, 
randomly selected 


e e) the average cumulative GPA of students who graduated from 
the college last year 

e f) all students who graduated from the college last year 

e g) the average cumulative GPA of students in the study who 
graduated from the college last year 


Solution: 


ete 3. e4.d5,b6;¢ 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. 


As part of a study designed to test the safety of automobiles, the 
National Transportation Safety Board collected and reviewed data 
about the effects of an automobile crash on test dummies. Here is the 
criterion they used: 


Speed at which Cars Location of “drive” (i.e. 
Crashed dummies) 
35 miles/hour Front Seat 


Cars with dummies in the front seats were crashed into a wall at a 
speed of 35 miles per hour. We want to know the proportion of 
dummies in the driver’s seat that would have had head injuries, if they 


had been actual drivers. We start with a simple random sample of 75 
cars. 


Solution: 
The population is all cars containing dummies in the front seat. 
The sample is the 75 cars, selected by a simple random sample. 


The parameter is the proportion of driver dummies (if they had been 
real people) who would have suffered head injuries in the population. 


The statistic is proportion of driver dummies (if they had been real 
people) who would have suffered head injuries in the sample. 


The variable X = the number of driver dummies (if they had been real 
people) who would have suffered head injuries. 


The data are either: yes, had head injury, or no, did not. 


Example: 
Exercise: 


Problem: 
Determine what the key terms refer to in the following study. 


An insurance company would like to determine the proportion of all 
medical doctors who have been involved in one or more malpractice 
lawsuits. The company selects 500 doctors at random from a 
professional directory and determines the number in the sample who 
have been involved in a malpractice lawsuit. 


Solution: 


The population is all medical doctors listed in the professional 
directory. 


The parameter is the proportion of medical doctors who have been 
involved in one or more malpractice suits in the population. 


The sample is the 500 doctors selected at random from the 
professional directory. 


The statistic is the proportion of medical doctors who have been 
involved in one or more malpractice suits in the sample. 


The variable X = the number of medical doctors who have been 
involved in one or more malpractice suits. 


The data are either: yes, was involved in one or more malpractice 
lawsuits, or no, was not. 


References 


The Data and Story Library, 
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Chapter Review 


The mathematical theory of statistics is easier to learn when you know the 
language. This module presents important terms that will be used 
throughout the text. 


HOMEWORK 


For each of the following eight exercises, identify: a. the population, b. the 
sample, c. the parameter, d. the statistic, e. the variable, and f. the data. 
Give examples where appropriate. 

Exercise: 


Problem: 


A fitness center is interested in the mean amount of time a client 
exercises in the center each week. 


Exercise: 


Problem: 


Ski resorts are interested in the mean age that children take their first 
ski and snowboard lessons. They need this information to plan their ski 
classes optimally. 


Solution: 


a. all children who take ski or snowboard lessons 

b. a group of these children 

c. the population mean age of children who take their first 
snowboard lesson 

d. the sample mean age of children who take their first snowboard 
lesson 

e, X = the age of one child who takes his or her first ski or 
snowboard lesson 

f. values for X, such as 3, 7, and so on 


Exercise: 
Problem: 
A cardiologist is interested in the mean recovery period of her patients 
who have had heart attacks. 
Exercise: 
Problem: 
Insurance companies are interested in the mean health costs each year 


of their clients, so that they can determine the costs of health 
insurance. 


Solution: 


a. the clients of the insurance companies 

b. a group of the clients 

c. the mean health costs of the clients 

d. the mean health costs of the sample 

e, X = the health costs of one client 

f. values for X, such as 34, 9, 82, and so on 


Exercise: 
Problem: 
A politician is interested in the proportion of voters in his district who 
think he is doing a good job. 
Exercise: 
Problem: 


A marriage counselor is interested in the proportion of clients she 
counsels who stay married. 


Solution: 


a. all the clients of this counselor 

b. a group of clients of this marriage counselor 

c. the proportion of all her clients who stay married 

d. the proportion of the sample of the counselor’s clients who stay 
married 

e, X = the number of couples who stay married 

f. yes, no 


Exercise: 
Problem: 


Political pollsters may be interested in the proportion of people who 
will vote for a particular cause. 


Exercise: 


Problem: 


A marketing company is interested in the proportion of people who 
will buy a particular product. 


Solution: 


a. all people (maybe in a certain geographic area, such as the United 
States) 

b. a group of the people 

c. the proportion of all people who will buy the product 

d. the proportion of the sample who will buy the product 

e, X = the number of people who will buy it 

f. buy, not buy 


Use the following information to answer the next three exercises: A Lake 
Tahoe Community College instructor is interested in the mean number of 
days Lake Tahoe Community College math students are absent from class 
during a quarter. 

Exercise: 


Problem: What is the population she is interested in? 
a. all Lake Tahoe Community College students 
b. all Lake Tahoe Community College English students 


c. all Lake Tahoe Community College students in her classes 
d. all Lake Tahoe Community College math students 


Exercise: 


Problem: Consider the following: 


X = number of days a Lake Tahoe Community College math student is 
absent 


In this case, X is an example of a: 


a. variable. 

b. population. 
c. Statistic. 

d. data. 


Solution: 


a 
Exercise: 
Problem: 


The instructor’s sample produces a mean number of days absent of 3.5 
days. This value is an example of a: 


a. parameter. 
b. data. 

c. Statistic. 
d. variable. 


Glossary 


Average 
also called mean or arithmetic mean; a number that describes the 
central tendency of the data 


Categorical Variable 
variables that take on values that are names or labels 


Data 


a set of observations (a set of possible outcomes); most data used in 
Statistical research can be put into two groups: categorical (an 
attribute whose value is a label) or quantitative (an attribute whose 
value is indicated by a number). Categorical data can be separated into 
two subgroups: nominal and ordinal. Data is nominal if it cannot be 
meaningfully ordered. Data is ordinal if the data can be meaningfully 
ordered. Quantitative data can be separated into two subgroups: 
discrete and continuous. Data is discrete if it is the result of counting 
(such as the number of students of a given ethnic group in a class or 
the number of books on a shelf). Data is continuous if it is the result of 
measuring (such as distance traveled or weight of luggage) 


Numerical Variable 
variables that take on values that are indicated by numbers 


Parameter 
a number that is used to represent a population characteristic and that 
generally cannot be determined easily 


Population 
all individuals, objects, or measurements whose properties are being 
studied 


Probability 
a number between zero and one, inclusive, that gives the likelihood 
that a specific event will occur 


Proportion 
the number of successes divided by the total number in the sample 


Representative Sample 
a subset of the population that has the same characteristics as the 
population 


Sample 
a subset of the population studied 


Statistic 


a numerical characteristic of the sample; a statistic estimates the 
corresponding population parameter. 


Variable 
a characteristic of interest for each person or object in a population 


Data, Sampling, and Variation -- MRU - C Lemieux (2017) 


Data may come from a population or from a sample. Small letters like x or 
y generally are used to represent data values. Most data can be put into the 
following categories: 


¢ Categorical 
e Quantitative 


Categorical data (also called qualitative data) are the result of categorizing 
or describing attributes of a population. Hair colour, blood type, ethnic 
group, the car a person drives, and the street a person lives on are examples 
of categorical data. Categorical data are generally described by words or 
letters. For instance, hair colour might be black, dark brown, light brown, 
blonde, grey, or red. Blood type might be AB+, O-, or B+. Researchers 
often prefer to use quantitative data over categorical data because it lends 
itself more easily to mathematical analysis. For example, it does not make 
sense to find an average hair or colour or blood type. 


There are two types of categorical data: nominal and ordinal. Nominal data 
is categorical data that cannot be ordered in a meaningful way. For example, 
the colour of a car is categorical, but the order of the colours are not 
meaningful. Ordinal data is categorical data that can be ordered in a 
meaningful way. For example, the level of satisfaction someone has with 
their experience at a restaurant from not at all satisfied to completely 
satisfied. 


Quantitative data are always numbers. Quantitative data are the result of 
counting or measuring attributes of a population. Amount of money, pulse 
rate, weight, number of people living in your town, and number of students 
who take statistics are examples of quantitative data. Quantitative data may 
be either discrete or continuous. 


All data that are the result of counting are called quantitative discrete 
data. These data take on only certain numerical values. If you count the 
number of phone calls you receive for each day of the week, you might get 
values such as zero, one, two, or three. 


All data that are the result of measuring are quantitative continuous data 
assuming that we can measure accurately. Measuring time, distance, area, 
and so on; anything that can be subdivided and then subdivided again and 
again is a continuous variable. If you and your friends carry backpacks with 
books in them to school, the numbers of books in the backpacks are discrete 
data and the weights of the backpacks are continuous data. 


Example: 

Data Sample of Quantitative Discrete Data 

The data are the number of books students carry in their backpacks. You 
sample five students. Two students carry three books, one student carries 
four books, one student carries two books, and one student carries one 
book. The numbers of books (three, four, two, and one) are the quantitative 
discrete data. 


Note: 
Try It 
Exercise: 


Problem: 

The data are the number of machines in a gym. You sample five 
gyms. One gym has 12 machines, one gym has 15 machines, one gym 
has ten machines, one gym has 22 machines, and the other gym has 
20 machines. What type of data is this? 


Solution: 
Try It Solutions 


quantitative discrete data 


Example: 


Data Sample of Quantitative Continuous Data 

The data are the weights of backpacks with books in them. You sample the 
same five students. The weights (in pounds) of their backpacks are 6.2, 7, 
6.8, 9.1, 4.3. Notice that backpacks carrying three books can have different 
weights. Weights are quantitative continuous data because weights are 
measured. 


Note: 
Try It 
Exercise: 


Problem: 


The data are the areas of lawns in square feet. You sample five 
houses. The areas of the lawns are 144 sq. feet, 160 sq. feet, 190 sq. 
feet, 180 sq. feet, and 210 sq. feet. What type of data is this? 


Solution: 
Try It Solutions 


quantitative continuous data 


Example: 

You go to the supermarket and purchase three cans of soup (19 ounces) 
tomato bisque, 14.1 ounces lentil, and 19 ounces Italian wedding), two 
packages of nuts (walnuts and peanuts), four different kinds of vegetable 
(broccoli, cauliflower, spinach, and carrots), and two desserts (16 ounces 
Cherry Garcia ice cream and two pounds (32 ounces chocolate chip 
cookies). 

Exercise: 


Problem: 


Name data sets that are quantitative discrete, quantitative continuous, 
categorical ordinal, and categorical nominal. 


Solution: 
One Possible Solution: 


e The three cans of soup, two packages of nuts, four kinds of 
vegetables and two desserts are quantitative discrete data because 
you count them. 

e The weights of the soups (19 ounces, 14.1 ounces, 19 ounces) are 
quantitative continuous data because you measure weights as 
precisely as possible. 

e Types of soups, nuts, vegetables and desserts are categorical 
nominal data because they are categories and fundamentally 
words. Further, there is no meaningful order. 

e Descriptions of amount of rain (e.g. light, heavy) are categorical 
ordinal data as they categories but have a meaningful order. 


Try to identify additional data sets in this example. 


Example: 

The data are the colors of backpacks. Again, you sample the same five 
students. One student has a red backpack, two students have black 
backpacks, one student has a green backpack, and one student has a gray 
backpack. The colors red, black, black, green, and gray are categorical 
nominal data. 


Note: 
Try It 
Exercise: 


Problem: 


The data are the colors of houses. You sample five houses. The colors 
of the houses are white, yellow, white, red, and white. What type of 
data is this? 


Solution: 
Try It Solutions 


categorical nominal data 


Note: 

Note 

You may collect data as numbers and report it categorically. For example, 
the quiz scores for each student are recorded throughout the term. At the 
end of the term, the quiz scores are reported as A, B, C, D, or F. The data is 
ordinal as there is a meaningful order. 


Note: 
Try It 
Exercise: 


Problem: 


Determine the correct data type (quantitative or categorical) for the 
number of cars in a parking lot. Indicate whether quantitative data are 
continuous or discrete. 


Solution: 
Try It Solutions 


quantitative discrete 


Example: 
Exercise: 


Problem: 


A statistics professor collects information about the classification of 
her students as freshmen, sophomores, juniors, or seniors. The data 
she collects are summarized in the pie chart [link]. What type of data 
does this graph show? 

Classification of Statistics Students 


Freshman 

® Sophomore 

~ Junior 
Senior 


Solution: 


This pie chart shows the students in each year, which is categorical 
nominal data. 


Note: 
Try It 
Exercise: 


Problem: 


The registrar at State University keeps records of the number of credit 
hours students complete each semester. The data he collects are 
summarized in the histogram. The class boundaries are 10 to less than 
13, 13 to less than 16, 16 to less than 19, 19 to less than 22, and 22 to 
less than 25. 


Number of Credit Hours 
Completed per Students 


Number of students 


10 13 16 19 22 25 
Credit hours completed 


What type of data does this graph show? 


Solution: 
Try It Solutions 


A histogram is used to display quantitative data: the numbers of credit 
hours completed. Because students can complete only a whole 
number of hours (no fractions of hours allowed), this data is 
quantitative discrete. 


Sampling 


Gathering information about an entire population often costs too much or is 
virtually impossible. Instead, we use a sample of the population. The goal 
would be to use information from the sample to estimate information about 
the population. 


To collect the sample, a sampling technique is used. Not all sampling 
techniques are created equal, though. A good sampling technique meets the 
following criteria: 


e The sample is collected randomly 
¢ The sample is representative of the population 
e The size of the sample is large enough 


If a sampling technique does not meet these criteria, then it is not 
appropriate to make inferences from the data. For example, it would not be 
appropriate to estimate the population mean from the sample mean. 


A random sample reduces bias, promotes representativeness, and is a key 
component to sampling. To do any scientific statistical analysis on 
sample data, the sample has to be randomly selected. In a random 
sample, members of the population are selected in such a way that each has 
an equal chance of being selected. To ensure that a sample is collected 
randomly, some element of randomness needs to be included in the 
sampling technique. This can involve using dice to choose the time to start 
collecting data or using a random number generator to pick names from a 
list of names. 


Note:Humans in general are not very random. Therefore, the randomness 
added to the sampling technique cannot be someone “randomly” choosing 
something. The randomness has to come from a random event (like rolling 
dice, flipping a coin, using a random number generator). 


A sample is representative if it shares similar characteristics to the 
population. For example, suppose that the students at a university are 


distributed as follows by faculty: 


e Business: 20% 

e Arts: 25% 

e Science and Engineering: 30% 
e Nursing: 15% 

e Education: 10% 


Then a sample would be representative of this population if the distribution 
of the students’ faculty in the sample was similar to the population. It 
doesn’t have to be exactly the same, but it should be close. A random 
sample will generate a fairly representative sample, but it doesn’t guarantee 
it. 


Note:What makes a sample representative depends on what is being 
studied. For example, if we are looking at the average age of students at a 
university, making sure we get students from each faculty would be 
important, but making sure we get students from various political 
affiliations might not be. 


Determining if a sample is large enough is a bit arbitrary and depends on 
the situation. In general, the larger the sample size the better, but issues 
such as time and money need to be taken into account. You don’t want to 
interview 5000 people, when 50 people would do. In Chapter 7, we will 
look at a formula that determines how many members of a population need 
to be in a sample depending on the level of error we are comfortable with. 
Until then, as a general rule, if the data is quantitative, a sample of at least 
30 is usually good enough. While if the data is categorical, a sample of at 
least 100 is usually good enough. 


In general, even if a sample is collected extremely well, it will not be 
perfectly representative of the population. The discrepancy between the 
sample and the population is called chance error due to sampling. When 
dealing with samples, there will always be error. Statistics helps us to 


understand and even measure this error. As a rule, the larger the random 
sample, in general the smaller the sampling error. 


Note: 

Generally, a sample that is collected randomly will likely be representative. 
But this is not guaranteed. For example, it is possible to collect a random 
sample of university students that happens to only contain students from 
one faculty. It is unlikely but possible. 

A large sample size does not guarantee a representative sample. Nor does a 
small sample size guarantee a non-representative sample. To illustrate, a 
sample of ten university students could be chosen so that proportion of 
students from each gender in the sample is similar to the population, and 
the proportion of students from each faculty in the sample is similar to the 
population. Thus, the sample of size 10 would be representative. The point 
of a larger sample size is that the larger the sample, the more likely it is to 
be representative. 

Of the three characteristics of a good sample, the most important one for 
Statistical analysis is that the sample is collected randomly. 


Areas of concern for sampling bias 

When people publish their research, they include a description of their 
sampling technique. This is called the methodology. When evaluating a 
sampling technique, check to see if the sample was collected randomly, if it 
is representative of the population, and if the sample is large enough. Here 
are some examples of areas of concern when looking at methodologies: 


1. Undercoverage occurs when a particular subset of the population is 
excluded from the process of selecting the sample. For example, if no 
one from the faculty of nursing is included in the sample, then we 
would say that the faculty of nursing is undercovered. As another 
example, undercoverage has been a specific concern in drug research 
over the years. In particular, women have been traditionally excluded 
from drug studies because of their menstrual cycles, but this results in 
the research only indicating how well the drug works for men. 


2. Nonresponse bias occurs when a member of the population that is 
selected as part of the sample cannot be contacted or refuses to 
participate. Have you ever refused to be part of a telephone study? If 
so, you are contributing to nonresponse bias. 


o Similar to nonresponse bias is voluntary response bias. Here a 
large segment of the population is contacted and people choose to 
participate or not. Examples of this are mail-out surveys or online 
polls. In these situations, usually the person is very invested in the 
issue so that is why they take the time to answer. This results in 
non-representative samples. Another form of voluntary response 
bias is online surveys. Here, only people familiar with the website 
are likely to participate or "volunteer" to be part of the survey. 

o Response rate is a measure of how many people responded out of 
the total contacted. If the response rate is low, then this suggests a 
very narrow segment of the population answered. This would 
raise concerns about representativeness. 


3. Asking potentially awkward questions might result in untruthful 
responses. This is called response bias. For example, if you are asked 
if you have ever had a sexually transmitted infection, you may not 
want to divulge that. One way to minimize response bias is to allow 
participants in a study to answer the questions anonymously. 

4. Improper wording of questions being asked might result in skewed 
answers. Here is an example of a question that skews the results: 


© Do you think it should be easier for seniors to make ends meet? 


= Yes —they’ve worked hard and helped build our country 
= No —seniors don’t need any help or recognition 


The wording of this question makes it hard to say "no". Thus, 
skewing the results towards "yes". 


A famous example of a survey that had a very poor methodology was the 
incorrect prediction by the Literary Digest that Dewey would beat Truman 
in the 1936 US election. Check out the following website for more 


information: 
https://www.math.upenn.edu/~deturck/m170/wk4/lecture/case2.html 


Sampling techniques 


Most statisticians and researchers use various methods of random sampling 
in an attempt to achieve a good sample. This section will describe a few of 
the most common techniques: simple random sampling, (proportional) 
stratified random sampling, cluster sampling, systematic random sampling, 
and convenience sampling. 


Simple random sampling 

The easiest method to describe is called a simple random sample. In this 
technique, a random sample is taken from the members of the population. 
This can be done by putting the names (or identifier) of all members of the 
population into a hat and pulling out those names (or identifiers) to choose 
the sample. Or the population can be numbered and a random number 
generator can choose the sample. Here, each member of the population has 
an equal chance of being chosen. If the goal of the technique is to get a very 
random sample, this is the best method to use. But it requires having a list 
of the whole population, which is not always realistic. 


For example, suppose you want to take a random sample of university 
students. Each student is already numbered by their student ID. You could 
randomly select the members of your sample by using a random number 
generator to randomly select student ID numbers. 


Stratified sampling and proportionate stratified sampling 

If there are concerns that a random sample might not fully represent a 
population (e.g. one portion of the population is small compared to 
another), the best sampling technique to use is stratified random 
sampling. In this case, divide the population into groups called strata and 
then take a random sample from each stratum. The stratum are chosen to be 
a portion of the population that needs to be represented in the sample. Each 
stratum needs to be mutually exclusive from any other strata. That means 
that each member of the population can only belong to one stratum. 


For example, you could stratify (group) your university population by 
faculty and then choose a simple random sample from each stratum (each 
faculty) to get a stratified random sample. As a student should only belong 
to one faculty, the groups are mutually exclusive. Further, this method 
ensures our sample is representative of the population by choosing students 
from each faculty at the university. Using the students per faculty example 
above, if the sample size is 100, to get a stratified sample, you would 
randomly select 20 students from each faculty (as there are 5 faculties and 
100 students, choose an equal number from each faculty). 


If the size of the sample is proportionate to the size of the strata, this is 
called proportionate stratified random sampling. If you wanted a 
proportionate stratified random sample for students by faculty, you would 
randomly select 20 students from business, 25 students from arts, 30 from 
science and engineering, 15 from nursing, and 10 from education (i.e. 
proportional to the number of students in each faculty). This technique is 
best used when there are large differences in the proportion of each group. 
For example, if the faculty of business had 50% of the students and the 
faculty of nursing only had 1% of the students, it would not be good to have 
an equal number of students from each faculty. 


Note:To randomly choose students from each faculty, a random sampling 
technique needs to be used. This could be simple random sampling or 
systematic random sampling (see below). 


Cluster sampling 

To choose a cluster sample, divide the population into clusters (groups) and 
then randomly select one of the clusters. That cluster is your sample. 
Further, the clusters need to be homogeneous and each cluster needs to be 
representative of the population. For example, suppose the university has a 
series of foundational classes that every student has to take and that 
students in these classes come from all faculties. Then we would randomly 
select one of these classes to be our sample. Again, to randomly select the 
four departments, you have to use a random sampling technique. Here, you 


could number all of the classes and then use a random number generator to 
choose one of them. 


If one cluster is too small for the sample, you can choose more than one 
cluster. For example, if you want your sample to be 120 students but each 
of the foundational classes only have 30 students in them, you can 
randomly select 4 classes to get to your desired sample size. 


Cluster sampling can be very convenient as the members of the sample are 
in one location. In the above example, the sample are in one class so you 
would just go to the one class and collect your sample. Notice that for 
stratified sampling, we would have to find each student chosen from each 
faculty. Thus, cluster sampling can save time and money. But it does 
present a real chance of undercoverage. If the foundational class chosen is 
at a time that nursing students are at a practicum, then that faculty would be 
undercovered. This means that cluster sampling can result in non- 
representative samples. This is only a good technique to use if the clusters 
are very similar to each other and each cluster would be representative of 
the population. 


Note: 

Cluster vs. stratified 

Cluster sampling and stratified sampling are often confused. In each case, 
the population is divided into groups. But, in stratified sampling, a few 
people from all groups (strata) are chosen. While in cluster sampling, all of 
the people from a group (cluster) are chosen. 

Additionally how the groups are chosen are different. In stratified 
sampling, the groups are chosen to be heterogeneous (i.e. each group has a 
different quality). As an example, breaking a university into different 
faculties results in groups that are heterogeneous as each group has a 
different quality (i.e. faculty) than the other groups. On the other hand, in 
cluster sampling, the groups are chosen to be homogeneous (i.e. the groups 
have similar qualities). That is, we want each cluster to be similar to the 
other groups. 


Systematic random sampling 

To choose a systematic random sample, randomly select a starting point 
and take every kth piece of data from a list of the population. For example, 
to choose a random sample of university students, you could use a list of all 
student names that are numbered by their student ID. Suppose there are 
14,000 students at the university. To perform systematic random sampling, 
use a random number generator to pick a student ID number that represents 
the first name in the sample. Then calculate k. To do this, k is found by 
taking the population size (14,000) and dividing by the size of the sample 
(100). In this case, this results in 140. Thus, from your random starting 
point, choose every 140th name thereafter until you have a total of 100 
names. If you reach the end of the list before completing your sample you 
simply go back to the beginning and keep going until the sample is 
complete. 


Be careful: k needs to be large enough to ensure that you cycle through all 
the names. Otherwise the sample is not random nor is it representative. If k 
had been 10, then once the random starting point was chosen only 1000 
names had a chance of being chosen which means that not everyone has an 
equal chance of being chosen. Further, depending on how the list is sorted, 
it may not be representative. For example, if our list of students is by 
faculty, then only certain faculties could make it in our sample. In our 
example, any k larger than 140 would be appropriate. Systematic sampling 
is frequently chosen because it is a simple method that can be easily 
implemented. But like simple random sampling, a list of the population is 
needed to do it properly. 


There is a variation of systematic random sampling that can be used when 
the list of the population does not exist or is not available to the people 
doing the pull. For example, suppose you are doing a survey about people’s 
satisfaction with a certain mall’s hours. You won’t have a list of all of the 
people who go to the mall. Instead, you may stand at an entrance to the mall 
and ask every fifth person who enters the mall to complete your survey. To 
ensure the sampling technique is representative, you’!l want to do the 
survey multiple times at multiple locations. To ensure that the sampling 
technique is random, you’ll want to randomly choose your starting times 
and locations. Having said that, this method would never be completely 


representative nor random. But may be your only choice if the population is 
not well defined. 


Note: 

Randomness and ethics 

When we are performing a study, we cannot force people to be part of it. 
People have a right to say no and as researchers we need to seek informed 
consent. That is, the participants should know what they are being asked to 
do, how their information will be kept secure, if there are any risks to 
participation (and if so what they are), and how to see the results of the 
study. As such, people can choose not to participant in a study. 

Thus, all studies involving humans are never completely random nor 
completely representative. Our goal when implementing sampling 
techniques is to minimize any bias that may come into the study because of 
this. 


Convenience sampling 

A type of sampling that is non-random is convenience sampling. 
Convenience sampling involves using results that are readily available. For 
example, a computer software store conducts a marketing study by 
interviewing potential customers who happen to be in the store browsing 
through the available software. The results of convenience sampling may be 
very good in some cases and highly biased (favour certain outcomes) in 
others. This is not a valid sampling technique when it comes to statistical 
inference. That is, if the data is collected using a convenience sample, then 
no conclusions can be made about the population from the sample. 


With replacement or without replacement 

True random sampling is done with replacement. That is, once a member is 
picked, that member goes back into the population and thus may be chosen 
more than once. However, for practical reasons, in most populations, simple 
random sampling is done without replacement. Surveys are typically done 
without replacement. That is, a member of the population may be chosen 
only once. Most samples are taken from large populations and the sample 


tends to be small in comparison to the population. Since this is the case, 
sampling without replacement is approximately the same as sampling with 
replacement because the chance of picking the same individual more than 
once with replacement is very low. 


Too illustrate how small of chance it is, consider a university with a 
population of 10,000 people. Suppose you want to pick a sample of 1,000 
randomly for a survey. For any particular sample of 1,000, if you are 
sampling with replacement, 


e the chance of picking the first person is 1,000 out of 10,000 (0.1000); 

e the chance of picking a different second person for this sample is 999 
out of 10,000 (0.0999); 

e the chance of picking the same person again is 1 out of 10,000 (very 
low). 


If you are sampling without replacement, 


e the chance of picking the first person for any particular sample is 1000 
out of 10,000 (0.1000); 

e the chance of picking a different second person is 999 out of 9,999 
(0.0999); 

e you do not replace the first person before picking the next person. 


Compare the fractions 999/10,000 and 999/9,999. For accuracy, carry the 
decimal answers to four decimal places. To four decimal places, these 
numbers are equivalent (0.0999). 


Sampling without replacement instead of sampling with replacement 
becomes a mathematical issue only when the population is small. For 
example, if the population is 25 people, the sample is ten, and you are 
sampling with replacement for any particular sample, then the chance of 
picking the first person is ten out of 25, and the chance of picking a 
different second person is nine out of 25 (you replace the first person). 


If you sample without replacement, then the chance of picking the first 
person is ten out of 25, and then the chance of picking the second person 
(who is different) is nine out of 24 (you do not replace the first person). 


Compare the fractions 9/25 and 9/24. To four decimal places, 9/25 = 0.3600 
and 9/24 = 0.3750. To four decimal places, these numbers are not 
equivalent. 


Example: 
Exercise: 


Problem: 


A study is done to determine the average tuition that San Jose State 
undergraduate students pay per semester. Each student in the 
following samples is asked how much tuition he or she paid for the 
Fall semester. What is the type of sampling in each case? 


a. A sample of 100 undergraduate San Jose State students is taken 
by organizing the students’ names by classification (freshman, 
sophomore, junior, or senior), and then selecting 25 students 
from each. 

b. A random number generator is used to select a student from the 
alphabetical listing of all undergraduate students in the Fall 
semester. Starting with that student, every 50th student is chosen 
until 75 students are included in the sample. 

c. A completely random method is used to select 75 students. Each 
undergraduate student in the fall semester has the same 
probability of being chosen at any stage of the sampling process. 

d. The freshman, sophomore, junior, and senior years are numbered 
one, two, three, and four, respectively. A random number 
generator is used to pick two of those years. All students in those 
two years are in the sample. 

e. An administrative assistant is asked to stand in front of the 
library one Wednesday and to ask the first 100 undergraduate 
students he encounters what they paid for tuition the Fall 
semester. Those 100 students are the sample. 


Solution: 


a. stratified; b. systematic; c. simple random; d. cluster; e. 
convenience 


Example: 
Exercise: 


Problem: 


Determine the type of sampling used (simple random, stratified, 
systematic, cluster, or convenience). 


a. A soccer coach selects six players from a group of boys aged 
eight to ten, seven players from a group of boys aged 11 to 12, 
and three players from a group of boys aged 13 to 14 to forma 
recreational soccer team. 

b. A pollster interviews all human resource personnel in five 
different high tech companies. 

c. A high school educational researcher interviews 50 high school 
female teachers and 50 high school male teachers. 

d. A medical researcher interviews every third cancer patient from 
a list of cancer patients at a local hospital. 

e. A high school counselor uses a computer to generate 50 random 
numbers and then picks students whose names correspond to the 
numbers. 

f. A student interviews classmates in his algebra class to determine 
how many pairs of jeans a student owns, on the average. 


Solution: 


a. stratified; b. cluster; c. stratified; d. systematic; e. simple random; 
f.convenience 


Note: 


Try It 
Exercise: 


Problem: 


Determine the type of sampling used (simple random, stratified, 
systematic, cluster, or convenience). 


A high school principal polls 50 freshmen, 50 sophomores, 50 juniors, 
and 50 seniors regarding policy changes for after school activities. 


Solution: 


stratified 


If we were to examine two samples representing the same population, even 
if we used random sampling methods for the samples, they would not be 
exactly the same. Just as there is variation in data, there is variation in 
samples. As you become accustomed to sampling, the variability will begin 
to seem natural. 


Example: 

Suppose ABC College has 10,000 part-time students (the population). We 
are interested in the average amount of money a part-time student spends 
on books in the fall term. Asking all 10,000 students is an almost 
impossible task. 

Suppose we take two different samples. 

First, we use convenience sampling and survey ten students from a first 
term organic chemistry class. Many of these students are taking first term 
calculus in addition to the organic chemistry class. The amount of money 
they spend on books is as follows: 

$128 $87 $173 $116 $130 $204 $147 $189 $93 $153 

The second sample is taken using a list of senior citizens who take P.E. 
classes and taking every fifth senior citizen on the list, for a total of ten 


senior citizens. They spend: 

$50 $40 $36 $15 $50 $100 $40 $53 $22 $22 

It is unlikely that any student is in both samples. 
Exercise: 


Problem: 


a. Do you think that either of these samples is representative of (or is 
characteristic of) the entire 10,000 part-time student population? 


Solution: 


a. No. The first sample probably consists of science-oriented students. 
Besides the chemistry course, some of them are also taking first-term 
calculus. Books for these classes tend to be expensive. Most of these 
students are, more than likely, paying more than the average part-time 
student for their books. The second sample is a group of senior 
citizens who are, more than likely, taking courses for health and 
interest. The amount of money they spend on books is probably much 
less than the average parttime student. Both samples are biased. Also, 
in both cases, not all students have a chance to be in either sample. 


Exercise: 


Problem: 


b. Since these samples are not representative of the entire population, 
is it wise to use the results to describe the entire population? 


Solution: 


b. No. For these samples, each member of the population did not have 
an equally likely chance of being chosen. 


Now, suppose we take a third sample. We choose ten different part-time 
students from the disciplines of chemistry, math, English, psychology, 
sociology, history, nursing, physical education, art, and early childhood 
development. (We assume that these are the only disciplines in which part- 
time students at ABC College are enrolled and that an equal number of 
part-time students are enrolled in each of the disciplines.) Each student is 


chosen using simple random sampling. Using a calculator, random 
numbers are generated and a student from a particular discipline is selected 
if he or she has a corresponding number. The students spend the following 
amounts: 

$180 $50 $150 $85 $260 $75 $180 $200 $200 $150 

Exercise: 


Problem: c. Is the sample biased? 


Solution: 


c. The sample is unbiased, but a larger sample would be 
recommended to increase the likelihood that the sample will be close 
to representative of the population. However, for a biased sampling 
technique, even a large sample runs the risk of not being 
representative of the population. 


Students often ask if it is "good enough" to take a sample, instead of 
surveying the entire population. If the survey is done well, the answer is 
yes. 


Note: 
Try It 
Exercise: 


Problem: 


A local radio station has a fan base of 20,000 listeners. The station 
wants to know if its audience would prefer more music or more talk 
shows. Asking all 20,000 listeners is an almost impossible task. 


The station uses convenience sampling and surveys the first 200 
people they meet at one of the station’s music concert events. 24 
people said they’d prefer more talk shows, and 176 people said they’d 
prefer more music. 


Do you think that this sample is representative of (or is characteristic 
of) the entire 20,000 listener population? 


Solution: 
Try It Solutions 


The sample probably consists more of people who prefer music 
because it is a concert event. Also, the sample represents only those 
who showed up to the event earlier than the majority. The sample 
probably doesn’t represent the entire fan base and is probably biased 
towards people who would prefer music. 


Variation in Data 


Variation is present in any set of data. For example, 16-ounce cans of 
beverage may contain more or less than 16 ounces of liquid. In one study, 
eight 16 ounce cans were measured and produced the following amount (in 
ounces) of beverage: 


15.8 16.1 15.2 14.8 15.8 15.9 16.0 15.5 


Measurements of the amount of beverage in a 16-ounce can may vary 
because different people make the measurements or because the exact 
amount, 16 ounces of liquid, was not put into the cans. Manufacturers 
regularly run tests to determine if the amount of beverage in a 16-ounce can 
falls within the desired range. 


Be aware that as you take data, your data may vary somewhat from the data 
someone else is taking for the same purpose. This is completely natural. 
However, if two or more of you are taking the same data and get very 
different results, it is time for you and the others to reevaluate your data- 
taking methods and your accuracy. 


Variation in Samples 


It was mentioned previously that two or more samples from the same 
population, taken randomly, and having close to the same characteristics of 
the population will likely be different from each other. Suppose Doreen and 
Jung both decide to study the average amount of time students at their 
college sleep each night. Doreen and Jung each take samples of 500 
students. Doreen uses systematic sampling and Jung uses cluster sampling. 
Doreen's sample will be different from Jung's sample. Even if Doreen and 
Jung used the same sampling method, in all likelihood their samples would 
be different. Neither would be wrong, however. 


Think about what contributes to making Doreen’s and Jung’s samples 
different. 


If Doreen and Jung took larger samples (i.e. the number of data values is 
increased), their sample results (the average amount of time a student 
sleeps) might be closer to the actual population average. But still, their 
samples would be, in all likelihood, different from each other. This 
variability in samples cannot be stressed enough. 


Size of a Sample 


The size of a sample (often called the number of observations) is important. 
The examples you have seen in this book so far have been small. Samples 
of only a few hundred observations, or even smaller, are sufficient for many 
purposes. In polling, samples that are from 1,200 to 1,500 observations are 
considered large enough and good enough if the survey is random and is 
well done. You will learn why when you study confidence intervals. 


Be aware that many large samples are biased. For example, call-in surveys 
are invariably biased, because people choose to respond or not. 


Critical Evaluation 


We need to evaluate the statistical studies we read about critically and 
analyze them before accepting the results of the studies. We listed common 


problems with sampling techniques above. We re-iterate them here and add 
a few additional ones. 


e Problems with samples: A sample must be representative of the 
population. A sample that is not representative of the population is 
biased. Biased samples that are not representative of the population 
give results that are inaccurate and not valid. 

e Self-selected samples: Responses only by people who choose to 
respond, such as call-in surveys, are often unreliable. 

e Sample size issues: Samples that are too small may be unreliable. 
Larger samples are better, if possible. In some situations, having small 
samples is unavoidable and can still be used to draw conclusions. 
Examples: crash testing cars or medical testing for rare conditions 

e Undue influence: collecting data or asking questions in a way that 
influences the response 

e Non-response or refusal of subject to participate: The collected 
responses may no longer be representative of the population. Often, 
people with strong positive or negative opinions may answer surveys, 
which can affect the results. 

e Causality: A relationship between two variables does not mean that 
one causes the other to occur. They may be related (correlated) 
because of their relationship through a different variable. 

e Self-funded or self-interest studies: A study performed by a person or 
organization in order to support their claim. Is the study impartial? 
Read the study carefully to evaluate the work. Do not automatically 
assume that the study is good, but do not automatically assume the 
study is bad either. Evaluate it on its merits and the work done. 

e Misleading use of data: improperly displayed graphs, incomplete data, 
or lack of context 

¢ Confounding: When the effects of multiple factors on a response 
cannot be separated. Confounding makes it difficult or impossible to 
draw valid conclusions about the effect of each factor. 
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Chapter Review 


Data are individual items of information that come from a population or 
sample. Data may be classified as categorical nominal, categorical ordinal, 


quantitative continuous, or quantitative discrete. 


Because it is not practical to measure the entire population in a study, 
researchers use samples to represent the population. A random sample is a 
representative group from the population chosen by using a method that 
gives each individual in the population an equal chance of being included in 
the sample. Random sampling methods include simple random sampling, 
stratified sampling, cluster sampling, and systematic sampling. 
Convenience sampling is a nonrandom method of choosing a sample that 
often produces biased data. 


Samples that contain different individuals result in different data. This is 
true even when the samples are well-chosen and representative of the 
population. When properly selected, larger samples model the population 
more closely than smaller samples. There are many different potential 
problems that can affect the reliability of a sample. Statistical data needs to 
be critically analyzed, not simply accepted. 


HOMEWORK 
For the following exercises, identify the type of data that would be used to 
describe a response (quantitative discrete, quantitative continuous, or 


categorical), and give an example of the data. 
Exercise: 


Problem: number of tickets sold to a concert 
Solution: 


quantitative discrete, 150 


Exercise: 


Problem: percent of body fat 


Solution: 


quantitative continuous, 19.2% 


Exercise: 


Problem: favorite baseball team 


Solution: 


categorical, Oakland A’s 


Exercise: 


Problem: time in line to buy groceries 


Solution: 


quantitative continuous, 7.2 minutes 


Exercise: 


Problem: number of students enrolled at Evergreen Valley College 


Solution: 


quantitative discrete, 11,234 students 


Exercise: 


Problem: most-watched television show 


Solution: 


categorical, Dancing with the Stars 


Exercise: 


Problem: brand of toothpaste 


Solution: 


categorical, Crest 


Exercise: 


Problem: distance to the closest movie theatre 


Solution: 


quantitative continuous, 8.32 miles 


Exercise: 


Problem: age of executives in Fortune 500 companies 


Solution: 


quantitative continuous, 47.3 years 


Use the following information to answer the next two exercises: A study 
was done to determine the age, number of times per week, and the duration 
(amount of time) of resident use of a local park in Vancouver. The first 
house in the neighbourhood around the park was selected randomly and 
then every 8th house in the neighbourhood around the park was 
interviewed. 

Exercise: 


Problem: “Number of times per week” is what type of data? 


a. nominal categorical ordinal 
b. quantitative discrete 

c. quantitative continuous 

d. categorical nominal 

e. categorical ordinal 


Solution: 


b 


Exercise: 


Problem: “Duration (amount of time)” is what type of data? 


a. Categorical discrete 

b. quantitative discrete 

c. quantitative continuous 
d. categorical nominal 

e. categorical ordinal 


Solution: 


c 
Exercise: 


Problem: 


Airline companies are interested in the consistency of the number of 
babies on each flight, so that they have adequate safety equipment. 
Suppose an airline conducts a survey. Over Thanksgiving weekend, it 
surveys six flights from Montreal to Halifax to determine the number 
of babies on the flights. It determines the amount of safety equipment 
needed by the result of that study. 


a. Using complete sentences, list three things wrong with the way 
the survey was conducted. 

b. Using complete sentences, list three ways that you would improve 
the survey if it were to be repeated. 


Solution: 


a. The survey was conducted using six similar flights. 
The survey would not be a true representation of the entire 
population of air travelers. 
Conducting the survey on a holiday weekend will not produce 
representative results. 


b. Conduct the survey during different times of the year. 
Conduct the survey using flights to and from various locations. 
Conduct the survey on different days of the week. 


Exercise: 


Problem: 


Suppose you want to determine the mean number of cans of soda 
drunk each month by students in their twenties at your school. 
Describe a possible sampling method in three to five complete 
sentences. Make the description detailed. 


Solution: 


Answers will vary. Sample Answer: You could use a systematic 
sampling method. Stop the tenth person as they leave one of the 
buildings on campus at 9:50 in the morning. Then stop the tenth person 
as they leave a different building on campus at 1:50 in the afternoon. 


Exercise: 


Problem: 
Name the sampling method used in each of the following situations: 


a. A woman in the airport is handing out questionnaires to travelers 
asking them to evaluate the airport’s service. She does not ask 
travelers who are hurrying through the airport with their hands 
full of luggage, but instead asks all travelers who are sitting near 
gates and not taking naps while they wait. 

b. A teacher wants to know if her students are doing homework, so 
she randomly selects rows two and five and then calls on all 
students in row two and all students in row five to present the 
solutions to homework problems to the class. 

c. The marketing manager for an electronics chain store wants 
information about the ages of its customers. Over the next two 
weeks, at each store location, 100 randomly selected customers 


are given questionnaires to fill out asking for information about 
age, as well as about other variables of interest. 

d. The librarian at a public library wants to determine what 
proportion of the library users are children. The librarian has a 
tally sheet on which she marks whether books are checked out by 
an adult or a child. She records this data for every fourth patron 
who checks out books. 

e. A political party wants to know the reaction of voters to a debate 
between the candidates. The day after the debate, the party’s 
polling staff calls 1,200 randomly selected phone numbers. If a 
registered voter answers the phone or is available to come to the 
phone, that registered voter is asked whom he or she intends to 
vote for and whether the debate changed his or her opinion of the 
candidates. 


Solution: 


convenience cluster stratified systematic simple random 
Exercise: 


Problem: 


In advance of the 1936 Presidential Election, a magazine titled Literary 
Digest released the results of an opinion poll predicting that the 
republican candidate Alf Landon would win by a large margin. The 
magazine sent post cards to approximately 10,000,000 prospective 
voters. These prospective voters were selected from the subscription 
list of the magazine, from automobile registration lists, from phone 
lists, and from club membership lists. Approximately 2,300,000 people 
returned the postcards. 


a. Think about the state of the United States in 1936. Explain why a 
sample chosen from magazine subscription lists, automobile 
registration lists, phone books, and club membership lists was not 
representative of the population of the United States at that time. 

b. What effect does the low response rate have on the reliability of 
the sample? 


c. Are these problems examples of sampling error or nonsampling 
error? 

d. During the same year, George Gallup conducted his own poll of 
30,000 prospective voters. His researchers used a method they 
called "quota sampling" to obtain survey answers from specific 
subsets of the population. Quota sampling is an example of which 
sampling method described in this module? 


Solution: 


a. The country was in the middle of the Great Depression and many 
people could not afford these “luxury” items and therefore not 
able to be included in the survey. 

b. Samples that are too small can lead to sampling bias. 

c. sampling error 

d. stratified 


Exercise: 


Problem: 


YouPolls is a website that allows anyone to create and respond to polls. 
One question posted April 15 asks: 


“Do you feel happy paying your taxes when members of the Obama 
administration are allowed to ignore their tax liabilities?” [footnote] 
lastbaldeagle. 2013. On Tax Day, House to Call for Firing Federal 
Workers Who Owe Back Taxes. Opinion poll posted online at: 
http://www. youpolls.com/details.aspx?id=12328 (accessed May 1, 
2013). 


As of April 25, 11 people responded to this question. Each participant 
answered “NO!” 


Which of the potential problems with samples discussed in this module 
could explain this connection? 


Solution: 


Self-Selected Samples: Only people who are interested in the topic are 
choosing to respond. Sample Size Issues: A sample with only 11 
participants will not accurately represent the opinions of a nation. 


Undue Influence: The question is wording in a specific way to 
generate a specific response. Self-Funded or Self-Interest Studies: This 
question was generated to support one person’s claim and it was 
designed to get the answer that the person desires. 


Glossary 


Cluster Sampling 
a method for selecting a random sample and dividing the population 
into groups (clusters); use simple random sampling to select a set of 
clusters. Every individual in the chosen clusters is included in the 
sample. 


Continuous Random Variable 
a random variable (RV) whose outcomes are measured; the height of 
trees in the forest is a continuous RV. 


Convenience Sampling 
a nonrandom method of selecting a sample; this method selects 
individuals that are easily accessible and may result in biased data. 


Discrete Random Variable 
a random variable (RV) whose outcomes are counted 


Nonsampling Error 
an issue that affects the reliability of sampling data other than natural 
variation; it includes a variety of human errors including poor study 
design, biased sampling methods, inaccurate information provided by 
study participants, data entry errors, and poor analysis. 


Qualitative Data 


See Data. 


Quantitative Data 
See Data. 


Random Sampling 
a method of selecting a sample that gives every member of the 
population an equal chance of being selected. 


Sampling Bias 
not all members of the population are equally likely to be selected 


Sampling Error 
the natural variation that results from selecting a sample to represent a 
larger population; this variation decreases as the sample size increases, 
so selecting larger samples reduces sampling error. 


Sampling with Replacement 
Once a member of the population is selected for inclusion in a sample, 
that member is returned to the population for the selection of the next 
individual. 


Sampling without Replacement 
A member of the population may be chosen for inclusion in a sample 
only once. If chosen, the member is not returned to the population 
before the next selection. 


Simple Random Sampling 
a straightforward method for selecting a random sample; give each 
member of the population a number. Use a random number generator 
to select a set of labels. These randomly selected labels identify the 
members of your sample. 


Stratified Sampling 
a method for selecting a random sample used to ensure that subgroups 
of the population are represented adequately; divide the population 
into groups (strata). Use simple random sampling to identify a 
proportionate number of individuals from each stratum. 


Systematic Sampling 
a method for selecting a random sample; list the members of the 
population. Use simple random sampling to select a starting point in 
the population. Let k = (number of individuals in the 
population)/(number of individuals needed in the sample). Choose 
every kth individual in the list starting with the one that was randomly 
selected. If necessary, return to the beginning of the population list to 
complete your sample. 


Experimental Design and Ethics - Optional section 


Does aspirin reduce the risk of heart attacks? Is one brand of fertilizer more 
effective at growing roses than another? Is fatigue as dangerous to a driver 
as the influence of alcohol? Questions like these are answered using 
randomized experiments. In this module, you will learn important aspects 
of experimental design. Proper study design ensures the production of 
reliable, accurate data. 


The purpose of an experiment is to investigate the relationship between two 
variables. When one variable causes change in another, we call the first 
variable the independent variable or explanatory variable. The affected 
variable is called the dependent variable or response variable. In a 
randomized experiment, the researcher manipulates values of the 
explanatory variable and measures the resulting changes in the response 
variable. The different values of the explanatory variable are called 
treatments. An experimental unit is a single object or individual to be 
measured. 


You want to investigate the effectiveness of vitamin E in preventing 
disease. You recruit a group of subjects and ask them if they regularly take 
vitamin E. You notice that the subjects who take vitamin E exhibit better 
health on average than those who do not. Does this prove that vitamin E is 
effective in disease prevention? It does not. There are many differences 
between the two groups compared in addition to vitamin E consumption. 
People who take vitamin E regularly often take other steps to improve their 
health: exercise, diet, other vitamin supplements, choosing not to smoke. 
Any one of these factors could be influencing health. As described, this 
study does not prove that vitamin E is the key to disease prevention. 


Additional variables that can cloud a study are called lurking variables. In 
order to prove that the explanatory variable is causing a change in the 
response variable, it is necessary to isolate the explanatory variable. The 
researcher must design her experiment in such a way that there is only one 
difference between groups being compared: the planned treatments. This is 
accomplished by the random assignment of experimental units to 
treatment groups. When subjects are assigned treatments randomly, all of 
the potential lurking variables are spread equally among the groups. At this 


point the only difference between groups is the one imposed by the 
researcher. Different outcomes measured in the response variable, therefore, 
must be a direct result of the different treatments. In this way, an 
experiment can prove a cause-and-effect connection between the 
explanatory and response variables. 


The power of suggestion can have an important influence on the outcome of 
an experiment. Studies have shown that the expectation of the study 
participant can be as important as the actual medication. In one study of 
performance-enhancing drugs, researchers noted: 


Results showed that believing one had taken the substance resulted in 
[performance] times almost as fast as those associated with consuming the 
drug itself. In contrast, taking the drug without knowledge yielded no 
significant performance increment.| footnote | 

McClung, M. Collins, D. “Because I know it will!”: placebo effects of an 
ergogenic aid on athletic performance. Journal of Sport & Exercise 
Psychology. 2007 Jun. 29(3):382-94. Web. April 30, 2013. 


When participation in a study prompts a physical response from a 
participant, it is difficult to isolate the effects of the explanatory variable. To 
counter the power of suggestion, researchers set aside one treatment group 
as a control group. This group is given a placebo treatment—a treatment 
that cannot influence the response variable. The control group helps 
researchers balance the effects of being in an experiment with the effects of 
the active treatments. Of course, if you are participating in a study and you 
know that you are receiving a pill which contains no actual medication, then 
the power of suggestion is no longer a factor. Blinding in a randomized 
experiment preserves the power of suggestion. When a person involved in a 
research study is blinded, he does not know who is receiving the active 
treatment(s) and who is receiving the placebo treatment. A double-blind 
experiment is one in which both the subjects and the researchers involved 
with the subjects are blinded. 


Example: 
Exercise: 


Problem: 


The Smell & Taste Treatment and Research Foundation conducted a 
study to investigate whether smell can affect learning. Subjects 
completed mazes multiple times while wearing masks. They 
completed the pencil and paper mazes three times wearing floral- 
scented masks, and three times with unscented masks. Participants 
were assigned at random to wear the floral mask during the first three 
trials or during the last three trials. For each trial, researchers recorded 
the time it took to complete the maze and the subject’s impression of 
the mask’s scent: positive, negative, or neutral. 


a. Describe the explanatory and response variables in this study. 

b. What are the treatments? 

c. Identify any lurking variables that could interfere with this study. 
d. Is it possible to use blinding in this study? 


Solution: 


a. The explanatory variable is scent, and the response variable is 
the time it takes to complete the maze. 

b. There are two treatments: a floral-scented mask and an unscented 
mask. 

c. All subjects experienced both treatments. The order of treatments 
was randomly assigned so there were no differences between the 
treatment groups. Random assignment eliminates the problem of 
lurking variables. 

d. Subjects will clearly know whether they can smell flowers or 
not, so subjects cannot be blinded in this study. Researchers 
timing the mazes can be blinded, though. The researcher who is 
observing a subject will not know which mask is being worn. 
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Chapter Review 


A poorly designed study will not produce reliable data. There are certain 
key components that must be included in every experiment. To eliminate 
lurking variables, subjects must be assigned randomly to different treatment 
groups. One of the groups must act as a control group, demonstrating what 
happens when the active treatment is not applied. Participants in the control 
group receive a placebo treatment that looks exactly like the active 
treatments but cannot influence the response variable. To preserve the 
integrity of the placebo, both researchers and subjects may be blinded. 
When a study is designed properly, the only difference between treatment 
groups is the one imposed by the researcher. Therefore, when groups 
respond differently to different treatments, the difference must be due to the 
influence of the explanatory variable. 


“An ethics problem arises when you are considering an action that benefits 
you or some cause you support, hurts or reduces benefits to others, and 

violates some rule.”[footnote] Ethical violations in statistics are not always 
easy to spot. Professional associations and federal agencies post guidelines 


for proper conduct. It is important that you learn basic statistical procedures 
so that you can recognize proper data analysis. 

Andrew Gelman, “Open Data and Open Methods,” Ethics and Statistics, 
http://www.stat.columbia.edu/~gelman/research/published/ChanceEthics1.p 
df (accessed May 1, 2013). 


Glossary 


Explanatory Variable 
the independent variable in an experiment; the value controlled by 
researchers 


Treatments 
different values or components of the explanatory variable applied in 
an experiment 


Response Variable 
the dependent variable in an experiment; the value that is measured 
for change at the end of an experiment 


Experimental Unit 
any individual or object to be measured 


Lurking Variable 
a variable that has an effect on a study even though it is neither an 
explanatory variable nor a response variable 


Random Assignment 
the act of organizing experimental units into treatment groups using 
random methods 


Control Group 
a group in a randomized experiment that receives an inactive treatment 
but is otherwise managed exactly as the other groups 


Informed Consent 
Any human subject in a research study must be cognizant of any risks 
or costs associated with the study. The subject has the right to know 


the nature of the treatments included in the study, their potential risks, 
and their potential benefits. Consent must be given freely by an 
informed, fit participant. 


Institutional Review Board 
a committee tasked with oversight of research programs that involve 
human subjects 


Placebo 
an inactive treatment that has no real effect on the explanatory variable 


Blinding 
not telling participants which treatment a subject is receiving 


Double-blinding 
the act of blinding both the subjects of an experiment and the 
researchers who work with the subjects 
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When you 
have large 
amounts 
of data, 
you will 
need to 
organize 
itina 
way that 
makes 
sense. 
These 
ballots 
from an 
election 
are rolled 
together 
with 
similar 
ballots to 
keep them 
organized 
. (credit: 
William 
Greeson) 


Note: 
Chapter objective 
By the end of this chapter, the student should be able to: 


e Display data graphically and interpret graphs: pie charts, bar graphs, 
histograms and box plots. 

e Recognize, describe, calculate, and interpret measures of location: 
quartiles and percentiles. 

e Recognize, describe, calculate, and interpret measures of centre: 
mean, median and mode. 

e Recognize, describe, calculate, and interpret measures of variation: 
variance, standard deviation, range, interquartile range and coefficient 
of variation. 


Once you have collected data, what will you do with it? Data can be 
described and presented in many different formats. For example, suppose 
you are interested in buying a house in a particular area. You may have no 
clue about the house prices, so you might ask your real estate agent to give 
you a sample data set of prices. Looking at all the prices in the sample often 
is overwhelming. A better way might be to look at the median price and the 
variation of prices. The median and variation are just two ways that you 
will learn to describe data. Your agent might also provide you with a graph 
of the data. 


In this chapter, you will study visual and numerical ways to describe and 
display your data. This area of statistics is called Descriptive Statistics. If 
you have collected 200 data values, just looking at them won’t tell anyone 
much about the data. Instead, you want to summarize the raw data in a way 
that you can better understand what’s going on. 


Categorical data is summarized usually using a visual representation like a 
pie chart or a bar graph. The numerical summary for categorical data would 
be a percentage, fraction or decimal. 


For quantitative data, it is a bit more involved. In general, there are three 
components to a good summary of quantitative data: a visual representation, 
a measure of centre, and a measure of variation. 


The visual representation can give you a sense of the centre and variation in 
the data, but is very useful for determining the shape of the data. Is the data 
all clustered together? Are there a bunch of data on one side, but a few on 
the other? Do all of the data values occur with the same frequency? The 
shape describes this. Histograms and box plots are both visual 
representations of quantitative data. 


Measures of centre, also known as averages or measures of central 
tendency, provide a value(s) that gives us a sense of a typical value in the 
data set. This doesn’t tell us about a specific member of the population, but 
instead lets us know what the average one is like. Measures of centre we 
will learn about include the mean, median, and mode. 


Though a measure of centre tells us about a typical value in a data set, 
measures of variation tell us how much the data values vary from each 
other. Are they all clumped together? Are they all spread out? Measures of 
variation can tell us how consistent or how volatile the data is. If we are 
analyzing stock prices, the more variation there is then the more volatile 
and risky the investment is. But the rewards may be greater! Measures of 
variation that we will learn about include range, variance, standard 
deviation, interquartile range, and the coefficient of variation. 


When we describe the shape, centre, and variation of the data, we are 
describing the distribution of the data. If we only focus on one aspect of 
the distribution (say the centre), then we miss out on some important 
information, which is why we always want to consider all three aspects 
when summarizing quantitative data. For example, suppose two stock prices 
have the same average price. If we only look at the average, we might think 
they are equivalent. But if one of them has greater variation, then that 
means that one is more volatile and riskier than the other one. 


Box plots (or box and whisker diagrams) are a special type of visual 
representation that includes both visual and numerical elements. A box plot 
divides the data into quarters (or quartiles). Thus, a box plot contains a 
measure of centre (the second quartile is the halfway point, called the 
median) and a measure of variation (the distance between the first quartile 
and the third quartile is called the interquartile range). The box plot can also 
give a sense of the data’s shape. The box plot then is the only representation 
that we will see that gives us a sense of the distribution all in one 
representation (i.e. gives a sense of centre, variation, and distribution). It 
also has an additional benefit of identifying outliers. Outliers are data 
values that are abnormal. That is, they differ significantly from the other 
data values. A box plot shows if there are any outliers. 


This chapter will go over descriptive statistics by focusing on visual and 
numerical representations of data. Though categorical data is discussed, the 
main focus will be on determining the distribution and outliers for 
quantitative data. 


The vast majority of the time when conducting statistical studies, we will 
only have access to sample data. In this situation, we will want to analyze 


the sample data to see if we can come to any conclusions about the 
population data. Once we make the leap from simply describing a sample to 
using that sample to draw conclusions about the population, we are doing 
inferential statistics. These concepts and techniques are covered in chapter 
seven and eight. 


Note: 

Key Idea 

The distribution of sample data ideally mimics the distribution of the 
population. But the smaller the sample size the greater the potential for 
there to be differences between the two distributions. This means that, for a 
large enough sample size, the distribution of the sample generally gives a 
good idea of distribution of the population. This is an example of the law 
of large numbers. In other words, if the sample size is large enough and 
the data is collected properly, then the sample mean will most likely be a 
good estimate of the population mean, the sample standard deviation will 
most likely be a good estimate of the population standard deviation, and 
the shape of the sample data will most likely be a good estimate of the 
shape of the population. 
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Visual representations of categorical data 


Below are tables comparing the number of part-time and full-time students at De 
Anza College and Foothill College enrolled for the spring 2010 quarter. The 
tables display counts (frequencies) and percentages or proportions (relative 
frequencies). The percent columns make comparing the same categories in the 
colleges easier. Displaying percentages along with the numbers is often helpful, 
but it is particularly important when comparing sets of data that do not have the 
same totals, such as the total enrollments for both colleges in this example. 
Notice how much larger the percentage for part-time students at Foothill College 
is compared to De Anza College. 


De Anza College Foothill College 

Number Percent Number Percent 
Full- 9,200 40.9% Full 4,059 28.6% 
time time 
a 13,296 59.1% jai 10,124 71.4% 
time time 
Total 22,496 100% Total 14,183 100% 


Fall Term 2007 (Census day) 


Tables are a good way of organizing and displaying data. But graphs can be even 
more helpful in understanding the data. There are no strict rules concerning 
which graphs to use. Two graphs that are used to display categorical data are pie 
charts and bar graphs. 


In a pie chart, categories of data are represented by wedges in a circle and are 
proportional in size to the percent of individuals in each category. 


In a bar graph, the length of the bar for each category is proportional to the 
number or percent of individuals in each category. Bars may be vertical or 
horizontal. 


Look at [link] and [link] and determine which graph (pie or bar) you think 
displays the comparisons better. 


It is a good idea to look at a variety of graphs to see which is the most helpful in 
displaying the data. We might make different choices of what we think is the 
“best” graph depending on the data and the context. Our choice also depends on 
what we are using the data for. 


De Anza College Foothill College 


' Part time 
® Full time 


Part time 
® Full time 


Student Status 


13296 


10124 


De Anza Foothill 
®@ Fulltime © Parttime 


Visual Representations of Quantitative Data 


Bar Graphs 


Bar graphs can also be used to summarize discrete quantitative data and 
categorical data. Bar graphs consist of bars that are separated from each other. 
The bars can be rectangles or they can be rectangular boxes (used in three- 
dimensional plots), and they can be vertical or horizontal. The bar graph shown 
in [link] has age groups represented on the x-axis and proportions on the y-axis. 
Exercise: 


Problem: 


By the end of 2011, Facebook had over 146 million users in the United 
States. [link] shows three age groups, the number of users in each age 
group, and the proportion (%) of users in each age group. Construct a bar 
graph using this data. 


Age Number of Facebook Proportion (%) of 
groups users Facebook users 


Age Number of Facebook Proportion (%) of 
groups users Facebook users 


13-25 65,082,280 45% 
26-44 53,300,200 36% 


45-64 27,885,100 19% 


Solution: 


Proportion (%) 
Nh 
oa 


13-25 26-44 45-64 
Ages 


Note: 
Try it 
Exercise: 


Problem: 


Park city is broken down into six voting districts. The table shows the 
percent of the total registered voter population that lives in each district as 
well as the percent total of the entire population that lives in each district. 
Construct a bar graph that shows the registered voter population by district. 


Registered voter Overall city 


District population population 
1 15.5% 19.4% 

2 12.2% 15.6% 

3 9.8% 9.0% 

4 17.4% 18.5% 

5 22.8% 20.7% 

6 22376 16.8% 

Solution 


25.0% 


20.0% 


15.0% 


10.0% 


5.0% 


Voter Proportion (%) 


0.0% 


District 


Frequency tables 


Twenty students were asked how many hours they worked per day. Their 
responses, in hours, are as follows: 56332475235654435253. 


[link] lists the different data values in ascending order and their frequencies. 


DATA VALUE FREQUENCY 


2 3 
3 fs) 
4 3 
fs) 6 
6 2 
7 1 


Frequency Table of Student Work Hours 


A frequency is the number of times a value of the data occurs. According to 
[link], there are three students who work two hours, five students who work three 
hours, and so on. The sum of the values in the frequency column, 20, represents 
the total number of students included in the sample. 


A relative frequency is the ratio (fraction or proportion) of the number of times 
a value of the data occurs in the set of all outcomes to the total number of 
outcomes. To find the relative frequencies, divide each frequency by the total 


number of students in the sample—in this case, 20. Relative frequencies can be 
written as fractions, percents, or decimals. 


DATA VALUE FREQUENCY RELATIVE FREQUENCY 


2 3 + or 0.15 


3 5 2 or 0.25 


DATA VALUE 


4 


FREQUENCY RELATIVE FREQUENCY 
3 3 
59 OF 0.15 
6 6 
39 OF 0.30 
2 
2 39 OF 0.10 
1 en or 0.05 


Frequency Table of Student Work Hours with Relative Frequencies 


The sum of the values in the relative frequency column of [link] is S , or 1. 


Cumulative relative frequency is the accumulation of the previous relative 
frequencies. To find the cumulative relative frequencies, add all the previous 
relative frequencies to the relative frequency for the current row, as shown in 


[link]. 


FREQUENCY 


3 


RELATIVE 
FREQUENCY 


a or 0.15 


5 
39 OF 0.25 
3 or 0.15 


$i or 0.30 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.15 


0157 025= 
0.40 


0.40 + 0.15 = 
0.55 


0.55 + 0.30 = 
0.85 


CUMULATIVE 


DATA RELATIVE RELATIVE 
VALUE FREQUENCY FREQUENCY FREQUENCY 
2 0.85 + 0.10 = 
6 2 39 OF 0.10 0.95 
1 0.95 + 0.05 = 
7 1 39 OF 0.05 1.00 


Frequency Table of Student Work Hours with Relative and Cumulative Relative 
Frequencies 


The last entry of the cumulative relative frequency column is one, indicating that 
one hundred percent of the data has been accumulated. 


Note: 

NOTE 

Because of rounding, the relative frequency column may not always sum to one, 
and the last entry in the cumulative relative frequency column may not be one. 
However, they each should be close to one. 


[link] represents the heights, in inches, of a sample of 100 male semiprofessional 
soccer players. 


CUMULATIVE 
HEIGHTS RELATIVE RELATIVE 
(INCHES) FREQUENCY FREQUENCY FREQUENCY 


60-61.99 5 a5 = 0.05 0.05 


HEIGHTS 
(INCHES) 
62-63.99 3 
6465.99 15 
66-67.99 40 
68-69.99 17 
70-71.99 12 
72-73.99 7 
74-75.99 i 


Total = 100 


FREQUENCY 


RELATIVE 
FREQUENCY 
=35 = 0.03 

+e = 0.15 

soo = 0-40 

sw = 0.17 

2 = 0.12 

sy = 0.07 

sp = 0.01 
Total = 1.00 


Frequency Table of Soccer Player Height 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.05 + 0.03 = 
0.08 


0.08 + 0.15 = 
0.23 


0.23 + 0.40 = 
0.63 


0.63 + 0.17 = 
0.80 


0.80 + 0.12 = 
0.92 


0.92 + 0.07 = 
0.99 


0.99 + 0.01 = 
1.00 


The data in this table have been grouped into the following intervals: 


60 to 61.99 inches 
62 to 63.99 inches 
64 to 65.99 inches 
66 to 67.99 inches 
68 to 69.99 inches 
70 to 71.99 inches 
72 to 73.99 inches 
74 to 75.99 inches 


In this sample, there are five players whose heights fall within the interval 59.95— 
61.95 inches, three players whose heights fall within the interval 61.95—63.95 
inches, 15 players whose heights fall within the interval 63.95—65.95 inches, 40 
players whose heights fall within the interval 65.95—67.95 inches, 17 players 
whose heights fall within the interval 67.95-69.95 inches, 12 players whose 
heights fall within the interval 69.95—71.95, seven players whose heights fall 
within the interval 71.95—73.95, and one player whose heights fall within the 
interval 73.95—75.95. All heights fall between the endpoints of an interval and 
not at the endpoints. 


Example: 
Exercise: 


Problem: 
From [link], find the percentage of heights that are less than 65.95 inches. 
Solution: 


If you look at the first, second, and third rows, the heights are all less than 
65.95 inches. There are 5 + 3 + 15 = 23 players whose heights are less than 
65.95 inches. The percentage of heights less than 65.95 inches is then aan 
or 23%. This percentage is the cumulative relative frequency entry in the 


third row. 


Note: 
Try It 
Exercise: 


Problem: 


[link] shows the amount, in inches, of annual rainfall in a sample of towns. 


Rainfall 
(Inches) 


ooo 
5-6.99 
Ieee 
10—11.99 
12—12.99 


13—14.99 


Frequency 


6 


15 


Total = 50 


Relative 
Frequency 
Hy = 0.12 
= 0.14 
=e = 0.30 
# = 0.16 
A = 0.18 
# = 0.10 
Total = 
1.00 


Cumulative 
Relative 
Frequency 


0.12 

0.12 + 0.14 = 0.26 
0.26 + 0.30 = 0.56 
0.56 + 0.16 = 0.72 
0.72 + 0.18 = 0.90 


0.90 + 0.10 = 1.00 


From [link], find the percentage of rainfall that is less than 9.99 inches. 


Solution: 


Try It Solutions 


0.56 or 56% 


Example: 
Exercise: 


Problem: 


From [link], find the percentage of heights that fall between 61.95 and 


65.95 inches. 


Solution: 


Add the relative frequencies in the second and third rows: 0.03 + 0.15 = 
0.18 or 18%. 


Note: 
Try It 
Exercise: 


Problem: 


From [link], find the percentage of rainfall that is between 7.00 and 12.99 
inches. 


Solution: 
Try It Solutions 


0.30 + 0.16 + 0.18 = 0.64 or 64% 


Example: 
Exercise: 


Problem: 


Use the heights of the 100 male semiprofessional soccer players in [link]. 
Fill in the blanks and check your answers. 


a. The percentage of heights that are from 67.95 to 71.95 inches is:___. 

b. The percentage of heights that are from 67.95 to 73.95 inches is:__. 

c. The percentage of heights that are more than 65.95 inches is: 

d. The number of players in the sample who are between 61.95 and 71.95 
inches tallis:  _. 

e. What kind of data are the heights? 

f. Describe how you could gather this data (the heights) so that the data 
are characteristic of all male semiprofessional soccer players. 


Remember, you count frequencies. To find the relative frequency, divide 
the frequency by the total number of data values. To find the cumulative 


relative frequency, add all of the previous relative frequencies to the 


relative frequency for the current row. 


Solution: 


a. 29% 

b. 36% 

C7/% 

d. 87 

e. quantitative continuous 


f. get rosters from each team and choose a simple random sample from 


each 


Example: 


Nineteen people were asked how many miles, to the nearest mile, they commute 
to work each day. The data are as follows: 25 732101815 207 10185 12 13 


12 45 10. [link] was produced: 


RELATIVE 
DATA FREQUENCY FREQUENCY 
3 3 4 

4 i a 

5 3 if 

7 2 2 


CUMULATIVE 
RELATIVE 
FREQUENCY 


0.1579 
0.2105 
0.1579 


0.2632 


CUMULATIVE 
RELATIVE RELATIVE 
DATA FREQUENCY FREQUENCY FREQUENCY 


10 3 — 0.4737 
ip 2 4 0.7895 
13 1 45 0.8421 
ile 1 +s 0.8948 
18 1 is 0.9474 
20 1 is 1.0000 


Frequency of Commuting Distances 


Exercise: 


Problem: 


a. Is the table correct? If it is not correct, what is wrong? 

b. True or False: Three percent of the people surveyed commute three 
miles. If the statement is not correct, what should it be? If the table is 
incorrect, make the corrections. 

c. What fraction of the people surveyed commute five or seven miles? 

d. What fraction of the people surveyed commute 12 miles or more? 
Less than 12 miles? Between five and 13 miles (not including five and 
13 miles)? 


Solution: 


a. No. The frequency column sums to 18, not 19. Not all cumulative 
relative frequencies are correct. 

b. False. The frequency for three miles should be one; for two miles (left 
out), two. The cumulative relative frequency column should read: 


O52, 051579; 0:2105;,0.3684,0'4737, 0:63.16; 0-738; 0.7895, 
0.8421, 0.9474, 1.0000. 


5 
: So 12, 
d. 39> 9» G9 
Note: 
Try It 
Exercise: 
Problem: 


[link] represents the amount, in inches, of annual rainfall in a sample of 


towns. What fraction of towns surveyed get at least 12 inches of rainfall 
each year? 


Solution: 
Try It Solutions 


14 
50 


Histograms 


In the introduction, the idea of distribution was introduced. The distribution 
refers to the shape, centre and variation of quantitative data. To determine the 
shape of the data, we need to look at a visual representation of the data. The best 
visual representation to look at is the histogram. 


Note:Bar graphs and histograms look very similar. They both have bars whose 
heights represent the frequency of the data. But bar graphs are used for 
categorical data and discrete quantitative data (i.e. whole number data). 
Histograms are used for continuous quantitative data (i.e. numbers with 
decimals) and sometimes discrete quantitative data as well. Since there is a gap 
between categories and whole numbers, the bars in bar graphs do not touch. But 


for continuous data, there is no gap between the numbers, so the bars for 
histograms do touch. 


For most of the work you do in this book, you will use a histogram to display the 
data. One advantage of a histogram is that it can readily display large data sets. 
The following explains how to make a histogram by hand, but you can use 
Statistical software to do this quite quickly. 


A histogram consists of contiguous (adjoining) boxes. It has both a horizontal 
axis and a vertical axis. The horizontal axis is labeled with what the data 
represents (for instance, distance from your home to school). The vertical axis is 
labeled either frequency or relative frequency (or percent frequency or 
probability). The graph will have the same shape with either label. The histogram 
(like the stemplot) can give you the shape of the data, the center, and the spread 
of the data. 


The relative frequency is equal to the frequency for an observed value of the data 
divided by the total number of data values in the sample.(Remember, frequency 
is defined as the number of times an answer occurs.) If: 


e f= frequency 

e n= total number of data values (or the sum of the individual frequencies), 
and 

e RF = relative frequency, 


then: 
Equation: 


from 90% to 100%, then, f = 3, n = 40, and RF = L = a = 0.075. 7.5% of the 
students received 90—100%. 90-—100% are quantitative measures. 


To construct a histogram, first decide how many bars or intervals, also called 
classes, represent the data. Many histograms consist of five to 15 bars or classes 
for clarity. The number of bars needs to be chosen. Choose a starting point for the 


first interval to be less than the smallest data value. A convenient starting point 
is a lower value carried out to one more decimal place than the value with the 
most decimal places. For example, if the value with the most decimal places is 
6.1 and this is the smallest value, a convenient starting point is 6.05 (6.1 — 0.05 = 
6.05). We say that 6.05 has more precision. If the value with the most decimal 
places is 2.23 and the lowest value is 1.5, a convenient starting point is 1.495 (1.5 
— 0.005 = 1.495). If the value with the most decimal places is 3.234 and the 
lowest value is 1.0, a convenient starting point is 0.9995 (1.0 — 0.0005 = 0.9995). 
If all the data happen to be integers and the smallest value is two, then a 
convenient starting point is 1.5 (2 —0.5 = 1.5). Also, when the starting point and 
other boundaries are carried to one additional decimal place, no data value will 
fall on a boundary. The next two examples go into detail about how to construct a 
histogram using continuous data and how to create a histogram using discrete 
data. 


Example: 

The following data are the heights (in inches to the nearest half inch) of 100 
male semiprofessional soccer players. The heights are continuous data, since 
height is measured. 

60; 60:5: 61. 6156105 

sierra hielo ners ea} 

64; 64; 64; 64; 64; 64; 64; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5; 64.5 

66; 66; 66; 66; 66; 66; 66; 66; 66; 66; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 66.5; 
6675;.00;0).00/0; 000407. OY 40/10 267907107) 0720) 0 0/7 6) H07 OAD. 
O75) O/ta07.0;, 07.0, 07.5 

003007 69%69>69: 69° 69--69-69"69> 69:69" 69 5469.5. 69.0; 915: 69.5 
702703708 /0)./ 05705705 Ube 70: 71a le 

Pit OI Ebel Drravey its alle an a9 

74 

The smallest data value is 60. Since the data with the most decimal places has 
one decimal (for instance, 61.5), we want our starting point to have two decimal 
places. Since the numbers 0.5, 0.05, 0.005, etc. are convenient numbers, use 
0.05 and subtract it from 60, the smallest value, for the convenient starting point. 
60 — 0.05 = 59.95 which is more precise than, say, 61.5 by one decimal place. 
The starting point is, then, 59.95. 

The largest value is 74, so 74 + 0.05 = 74.05 is the ending value. 

Next, calculate the width of each bar or class interval. To calculate this width, 
subtract the starting point from the ending value and divide by the number of 


bars (you must choose the number of bars you desire). Suppose you choose eight 
bars. 
Equation: 


74.05 — 59.95 


= 1.76 
8 


Note: 

NOTE 

We will round up to two and make each bar or class interval two units wide. 
Rounding up to two is one way to prevent a value from falling on a boundary. 
Rounding to the next number is often necessary even if it goes against the 
standard rules of rounding. For this example, using 1.76 as the width would 
also work. A guideline that is followed by some for the width of a bar or class 
interval is to take the square root of the number of data values and then round to 
the nearest whole number, if necessary. For example, if there are 150 values of 
data, take the square root of 150 and round to 12 bars or intervals. 


The boundaries are: 


eae) 

She eae elo dines) 
Clesoat 2 = 6o-90 
GoJo Ooo 
eerue oe 4 lovee 
67 Jot 2 — 69.95 
Coon — 71095 
Valitse eh to pee 5) 
US Sein Vises 


The heights 60 through 61.5 inches are in the interval 59.95-61.95. The heights 
that are 63.5 are in the interval 61.95-63.95. The heights that are 64 through 
64.5 are in the interval 63.95—-65.95. The heights 66 through 67.5 are in the 
interval 65.95-67.95. The heights 68 through 69.5 are in the interval 67.95— 
69.95. The heights 70 through 71 are in the interval 69.95—71.95. The heights 72 


through 73.5 are in the interval 71.95—73.95. The height 74 is in the interval 
73.95-795.95: 


The following histogram displays the heights on the x-axis and relative 


frequency on the y-axis. 
0.4 


0.4 
0.35 


0.25 


0.15 


Relative frequency 
oO 
Nm 


0.05 


So & 8 & OG G& > > 
>, &, B, & &, %, %, 8 Ss 
"SO OR GOS OR SES 


Heights 


Note: 

Titles, labelling and numbering of visual representations 

Visual representations should be numbered. As they are images, they would be 
numbered as figures. For example, a histogram would be numbered “Figure 3”. 
This means it is the third image in the document. This makes it easier to refer 
back to: “In Figure 3, we can see that ...” 

The title of the visual representation includes the name of the visual 
representation and the context: “Histogram of ...”. 

The label that goes along the axis includes the variable and the unit: Variable 
(unit). 

These three aspects combined will make it easy to refer to the image and let the 
reader of the image know what the image is about. 

A frequency table would be similarly titled and labelled, but since it is a table 
and not an image, it would be referred to as “Table 4” (meaning the fourth table 
in the document). 

As you look through this textbook, notice how all of the images and tables are 
numbered as described above. 


Note: 
Try It 


Exercise: 


Problem: 


The following data are the shoe sizes of 50 male students. The sizes are 
continuous data since shoe size is measured. Construct a histogram and 
calculate the width of each bar or class interval. Suppose you choose six 
bars. 

OO OO lO OSLO Oe LOO aoa O oe OLS lis ses: 
10s. 10:5 

1 ag Fe aE i Let st CE ele a i Ud hts eee Isto ser Io 
dileey alc 

NZD ad ed Dees eee eae os ae satel ear a 


Solution: 

Smallest value: 9 

Largest value: 14 

Convenient starting value: 9 — 0.05 = 8.95 


Convenient ending value: 14 + 0.05 = 14.05 


14.05-8.95 __ 0.85 
ceri sae 5 


The calculations suggests using 0.85 as the width of each bar or class 
interval. You can also use an interval with a width equal to one. 


Shape 


The shape of the data helps us understand what kind of pattern the data has. For 
example, if all of the data values have the same frequency, then the shape will be 
distinct (it is called uniform). If the data has a skew in it, then that helps us 
understand the measure of centre better (to be discussed in the next section). 
Overall, the shape helps us see how the data is behaving. Data that has similar 
shapes will behave in similar ways. 


The shape of the data set is determined by looking at a visual representation of 
the data and usually the histogram. Common ways of describing the shape 
include whether it is symmetrical or not, how many distinct peaks it has 
(unimodal, bimodal, multimodal), and whether the data has a tail only on one 
side (skew). 


e Data is symmetric if the shape is same on both sides of centre. 

e Skewed data has a "tail" on one side. This means that there are some data 
values that are far from the centre but only one one side. This is a type of 
non-symmetric data. 

¢ Fora histogram, the term "modal" refers to the number of distinct peaks. 
You almost want to think about mountain peaks. If there are multiple, 
distinct mountain peaks, then we say the data is multi-modal. If there is only 
one distinct peak, then the data is uni-modal. Not all data has a distinct 
peak. 

e Uniform data occurs if the frequency of each interval is about the same. 
This will result in a flat looking histogram. 

e A very important shape in statistics is the bell-curve (the shape in the first 
row, second column). This shape is symmetric, uni-modal and looks like a 
bell. If data has this shape (and satisfies a few other properties that will be 
discussed in Chapter 5), we call this data normal. 


Here are some examples of different shapes of data: 
Various shapes that data can have 


JW J\ IN 


Skewed left Symmetric Skewed right Uniform 
Bimodal skewed left ny Bimodal skewed right 


Z\ fv KR: 


Skewed left with outliers Symmetric with outliers Skewed right with outliers 


Here are some examples of possible shapes that data can take 


The above is provided to give you some ideas on how to describe the shape of 
data. But not all data sets have a nice shape that fits into one of the above. 
Sometimes the data can only be described as non-symmetric. 


How NOT to Lie with Statistics 


It is important to remember that the very reason we develop a variety of methods 
to present data is to develop insights into the subject of what the observations 
represent. We want to get a "sense" of the data. Are the observations all very 
much alike or are they spread across a wide range of values, are they bunched at 
one end of the spectrum or are they distributed evenly and so on. We are trying to 
get a visual picture of the numerical data. Shortly we will develop formal 
mathematical measures of the data, but our visual graphical presentation can say 
much. It can, unfortunately, also say much that is distracting, confusing and 
simply wrong in terms of the impression the visual leaves. Many years ago 
Darrell Huff wrote the book How to Lie with Statistics. It has been through 25 
plus printings and sold more than one and one-half million copies. His 
perspective was a harsh one and used many actual examples that were designed 
to mislead. He wanted to make people aware of such deception, but perhaps more 
importantly to educate so that others do not make the same errors inadvertently. 


Again, the goal is to enlighten with visuals that tell the story of the data. Pie 
charts have a number of common problems when used to convey the message of 
the data. Too many pieces of the pie overwhelm the reader. More than perhaps 
five or six categories ought to give an idea of the relative importance of each 
piece. This is after all the goal of a pie chart, what subset matters most relative to 
the others. If there are more components than this then perhaps an alternative 
approach would be better or perhaps some can be consolidated into an "other" 
category. Pie charts cannot show changes over time, although we see this 
attempted all too often. In federal, state, and city finance documents pie charts 
are often presented to show the components of revenue available to the 
governing body for appropriation: income tax, sales tax motor vehicle taxes and 
so on. In and of itself this is interesting information and can be nicely done with 
a pie chart. The error occurs when two years are set side-by-side. Because the 
total revenues change year to year, but the size of the pie is fixed, no real 


information is provided and the relative size of each piece of the pie cannot be 
meaningfully compared. 


Histograms can be very helpful in understanding the data. Properly presented, 
they can be a quick visual way to present probabilities of different categories by 
the simple visual of comparing relative areas in each category. Here the error, 
purposeful or not, is to vary the width of the categories. This of course makes 
comparison to the other categories impossible. It does embellish the importance 
of the category with the expanded width because it has a greater area, 
inappropriately, and thus visually "says" that that category has a higher 
probability of occurrence. 


Changing the units of measurement of the axis can smooth out a drop or 
accentuate one. If you want to show large changes, then measure the variable in 
small units, penny rather than thousands of dollars. And of course to continue the 
fraud, be sure that the axis does not begin at zero, zero. If it begins at zero, zero, 
then it becomes apparent that the axis has been manipulated. 


Again, the goal of descriptive statistics is to convey meaningful visuals that tell 
the story of the data. Purposeful manipulation is fraud and unethical at the worst, 
but even at its best, making these type of errors will lead to confusion on the part 
of the analysis. 
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Chapter Review 


A bar graph is a chart that uses either horizontal or vertical bars to show 
comparisons among categories. One axis of the chart shows the specific 
categories being compared, and the other axis represents a discrete value. Some 
bar graphs present bars clustered in groups of more than one (grouped bar 


graphs), and others show the bars divided into subparts to show cumulative effect 
(stacked bar graphs). Bar graphs are especially useful when categorical data is 
being used, but they can also be used for quantitative discrete data. 


A histogram is a graphic version of a frequency distribution. The graph consists 
of bars of equal width drawn adjacent to each other. The horizontal scale 
represents classes of quantitative data values and the vertical scale represents 
frequencies. The heights of the bars correspond to frequency values. Histograms 
are typically used for large, continuous, quantitative data sets. 


Exercise: 


Problem: 


The students in Ms. Ramirez’s math class have birthdays in each of the four 
seasons. [link] shows the four seasons, the number of students who have 
birthdays in each season, and the percentage (%) of students in each group. 
Construct a bar graph showing the percentage of students in each group. 


Seasons 
Spring 

Summer 
Autumn 


Winter 


Solution: 


Number of students 


8 


9 


11 


6 


Proportion of population 
24% 
26% 
32% 


18% 


35% 
30% 
25% 
20% 
15% 


Proportion (%) 


10% 
5% 
0% 


Spring 


Exercise: 


Problem: 


Summer Autumn Winter 


Birthdays in each season 


David County has six high schools. Each school sent students to participate 
in a county-wide science competition. [link] shows the percentage 
breakdown of competitors from each school, and the percentage of the 
entire student population of the county that goes to each school. Construct a 
bar graph that shows the county-wide population percentage of students at 


each school. 


High 
School 


Alabaster 
Concordia 
Genoa 
Mocksville 


Tynneson 


Science competition 
population 


28.9% 
7.6% 

12.1% 
18.5% 


24.2% 


Overall student 
population 


8.6% 

23.2% 
15.0% 
14.3% 


10.1% 


High Science competition Overall student 


School population population 
West End 8.7% 28.8% 
Solution: 
35.0% 
30.0% 
= 25.0% 
5 20.0% 
= 
S 15.0% 
° 
a 10.0% 


5.0% 


0.0% 
Alabaster Concordia) Genoa Mocksville Tynneson West End 
Students in science competition from each school 


Exercise: 


Problem: Construct a histogram for the following: 


a, Pulse Rates for Women Frequency 
60-69 12 
70-79 14 
80-89 11 
90-99 1 


100—109 il 


Pulse Rates for Women Frequency 


110-119 0 

120-129 1 

Actual Speed in a 30 MPH Zone Frequency 
42-45 25 

46-49 14 

50-53 7 

54-57 o) 

58-61 il 

Tar (mg) in Nonfiltered Cigarettes Frequency 
10-13 1 

14-17 0 

18-21 15 


22-25 7 


Tar (mg) in Nonfiltered Cigarettes Frequency 


26-29 2 


Homework 


Use the following information to answer the next two exercises: Suppose one 
hundred eleven people who shopped in a special t-shirt store were asked the 
number of t-shirts they own costing more than $19 each. 


40/111 
30/111 
20/111 


10/111 


Relative frequency 


0 


1 2 3 4 5 6 7 
Number of T-shirts costing more than $19 each 


Exercise: 
Problem: 


The percentage of people who own at most three t-shirts costing more than 
$19 each is approximately: 


a. 21 
b. 59 
c. 41 
d. Cannot be determined 


Solution: 


C 


Exercise: 


Problem: 


If the data were collected by asking the first 111 people who entered the 
store, then the type of sampling is: 


a. cluster 

b. simple random 
c. stratified 

d. convenience 


Solution: 


d 


Glossary 


Frequency 
the number of times a value of the data occurs 


Histogram 
a graphical representation in x-y form of the distribution of data in a data 
set; x represents the data and y represents the frequency, or relative 
frequency. The graph consists of contiguous rectangles. 


Relative Frequency 
the ratio of the number of times a value of the data occurs in the set of all 
outcomes to the number of all outcomes 


Descriptive Statistics - Numerical Summaries of Data - MRU - C Lemieux 


By the end of this section, we want to be able to describe the distribution of 
quantitative data (i.e. shape, centre and variation). In the previous section, we 
looked at the shape of quantitative data. This section focuses on numerical 
summaries of data for quantitative data. In particular, it focuses on measures of 
centre and measures of variation. 


There are other numerical summaries of data called measures of location, 
which will be discussed in the next section. 


Measures of centre 


Measures of centre or average give us a sense of what a typical value in a data 
set is. For example, the average number of children in a family in Canada is 
1.9. This means that a typical family will have about 1.9 children. Obviously, 
no family has exactly 1.9 children, but this gives a sense of how many children 
families have on average. Further, some families may have 8 children. Others 
may have no children. The measure of centre gives a sense of what is going on 
in the middle of the data set. 


Note:Even though you may wish to round an average to a whole number 
(especially when it is about the number of people), this is not necessary nor is 
it appropriate as it is giving a sense of the centre of the data, which is not 
necessarily an actual data value. 


The "center" of a data set is a way of describing a typical value in a data set. 
The three most widely used measures of the "center" of the data are the mean, 
median and mode. 


To explain these three measures of centre, let’s look at an example. Suppose we 
want to find the average weight of 50 people. To calculate the mean weight of 
the 50 people, we would add the 50 weights together and divide by 50. To find 
the median weight of the 50 people, order the data from least heavy to most 
heavy, and find the weight that splits the data into two equal parts. The mode is 


the most commonly occurring value. To find the mode, find the weight that 
occurs the most frequently. 


This section provides more details on how to find the measures of centre, the 
notation for the measures, and when it is best to used which measure. 


Note: 

NOTE 

Though the words “mean” and “average” are sometimes used interchangeably, 
they do not necessarily mean the same thing. In general, “average” is any 
measure of centre and “mean” is a specific type of centre. Many people use 
average and mean as the same, but not always. For example, when people talk 
about average housing price, they are usually referring to the median house 
price. 


Mean 


The mean of a data set can be thought of as a balancing point (or fulcrum). If 
you think of numbers as weighted, then the mean is the number that will 
balance the data values evenly. Suppose your data values are 1, 2, 3, 4, 5. Then 
the number that balances the data is 3. To go a little deeper, the balance point is 
three because the distance between 3 and the data values less than it is equal to 
the distance between 3 and the data values greater than it as shown in [link]. 


To find the mean of this data, we need to find the number that balances the 
data equally on both sides. 


Let's try a harder example. Suppose our data values are 0, 1, 1, 2, 3, 3, 4, 6. The 
mean will be the number such that the total distance to the data values below it 
and the total distance to the data values above it are the same. Let's see 3 is the 
mean again. Then the distance between our suggested "mean" and 0 is 3; the 
distance between our "mean" and 1 is 2 (but there are two of them); and the 
distance between our "mean" and 2 is 1. That is, the distance between our 
"mean" and all of the data values below it are 3+2+2+1 = 8. If 3 is actually our 
mean, then the total distance between 3 and the data values above it will also be 
8. Let's check. The distance between our "mean" and 4 is 1; the distance 
between our "mean" and 6 is 3. The total distance above 3 is only 4. Therefore, 
3 cannot be our mean as it doesn't balance our data. 


Note:The two data values of 3 were ignored as their distance from the 
suggested mean is 0. Therefore, they would not change the answer if included. 


From our calculations above, the choice of 3 was too big as the lower was too 
heavy. Let's try 2.5 as our mean. If the mean is 2.5, then the distance between 
our "mean" and 0 is 2.5; the distance between our "mean" and 1 is 1.5 (but 


there are two of them); the distance between our "mean" and 2 is 0.5. Thus the 
total distance between our mean of 2.5 and the data values below is is 2.5+ 1.5 
+ 1.5+ 0.5 =6. If 2.5 is our mean, then the total distance above 2.5 should also 
be 6. The distance between our "mean" and 3 is 0.5 (but there are two of them); 
the distance between our "mean" and 4 is 1.5; the distance between our "mean" 
and 6 is 3.5. Thus the total distance between the data values and our suggested 
mean of 2.5is 0.5+0.5+ 1.5 + 3.5 =6! Therefore, 2.5 is the mean for this data. 


To find the mean of this data, we need to find the number that balances the 
data equally on both sides. Notice that the mean here is not a data value. 


Thankfully we don't have to do these in-depth calculations and guesses each 
time. Instead the formula is pretty straight-forward. 


The Greek letter : (pronounced "mew" ) represents the population mean. That 
is, it is the mean for the population data. 
Equation: 

Formula for Population Mean 


The letter used to represent the sample mean is an x with a bar over it 
(pronounced “x bar’): a. It is the mean of a sample of data from the population. 


The sample mean is an estimate of the population mean. One of the 
requirements for the sample mean to be a good estimate of the population 
mean is for the sample taken to be truly random. 
Equation: 

Formula for Sample Mean 


To see how the formula words, consider the sample: 
1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4 
Equation: 


lp lado e a ee aa ee 


2.0 
11 


p= 
Note: Since it is sample data, we use the symbol z. 


Note: 

Application of the law of large numbers 

If the size of a random sample is increased, then the sample mean will more 
likely be a better estimate of the population mean. 

Note: Just because the sample size increases does not mean that the sample 
mean for the larger sample must be a better estimate. It is only that it is more 
likely to be a better estimate. 


Median 


On a road, the median is in the middle of the road. In statistics, the median is 
the middle data value (when the data is in order). 


You can quickly find the location or position of the median by using the 


: n+1 
expression “>: 


The letter n is the total number of data values in the sample. If n is an odd 
number, the median is the middle value of the ordered data (ordered smallest to 
largest). If n is an even number, the median is equal to the two middle values 
added together and divided by two after the data has been ordered. For 
example, if the total number of data values is 97, then at = ae = 49. The 
median is the 49" value in the ordered data. If the total number of data values is 
100, then net = 40h = 50.5. The median occurs midway between the 50" 
and 51° values. The location of the median and the value of the median are not 
the same. The upper case letter M is often used to represent the median. The 
next example illustrates the location of the median and the value of the median. 


Mode 


Another measure of the center is the mode. The mode is the data value that 
occurs most frequently and at least twice. 


A data set can have either 


¢ no mode. 

e one mode (unimodal) 

¢ two modes (bimodal) 

¢ or many modes (multimodal). 


Consider the statistics exam scores for 20 students: 


5053595963637272727272767881838484849093 
The most frequent score is 72, which occurs five times. Mode = 72. 


Note: 


Note 

The mode can be calculated for qualitative data as well as for quantitative data. 
For example, if the data set is: red, red, red, green, green, yellow, purple, black, 
blue, the mode is red. 


Example: 
Exercise: 


Problem: 


AIDS data indicating the number of months a patient with AIDS lives 
after taking a new antibody drug are as follows (smallest to largest): 

BF on eh eh arlio eh: Ow ere ie se be 
DSS OLS AG dO PLT OA EADAS BRAS otal LORS Pun ne coro js got! ln 7 eG Yo am Ceci UO yal ea a ip 
ae. 

Calculate the mean, median and mode. 


Solution: 


The calculation for the mean is: 


ros [3-+4+ (8) (2)+10+11+12+13+144 es + (16) (2)-+...+35+37+40+(44)(2) +47] __ 23.6 


To find the median, M, first use the formula for the location. The location 
is: 

See Sas 

Starting at the smallest value, the median is located between the 20" and 
21° values (the two 24s): 

a aon (Oe dn alee chG sel Graly sly NO el oa 
AaENOOLS AOAC HWA © a) oP RE IAS Wards pare | bate Vine hie fo er G1! eho y! eas le nm vada: | payr lal ar Fae 
47; 


N= zalZe = 24 To find the mode, we first have to determine if any 
data values repeat. If no data values repeat, there is no mode. Since 8 
repeats, we know there is a mode. 8 repeats twice. We need to check if 
any data value repeats more than twice. If a data value repeats more than 
twice, then it is the mode. Since no data value repeats more than twice, 
any data value that repeats twice is the mode. 


Therefore, the modes are 8, 15, 16, 17, 22, 24, 26, 27, 29, 34, 44. This 
data set is multi-modal. 


Example: 
Exercise: 


Problem: 


Suppose that in a small town of 50 people, one person earns $5,000,000 
per year and the other 49 each earn $30,000. Which is the better measure 
of the "center": the mean, the median or the mode? 


Solution: 


em 2 eee = 129,400 


M = 30,000 


(There are 49 people who earn $30,000 and one person who earns 
$5,000,000.) 


The mode is 30,000 as this data value occurs 49 times. 


Since the median and mode are equal, lets focus on the median. The 
median is a better measure of the "center" than the mean because 49 of 
the values are 30,000 and one is 5,000,000. The 5,000,000 is an outlier. 
The 30,000 gives us a better sense of the middle of the data. 


Note:The above example highlights two important ideas: 


¢ Outliers: We have defined outliers as data values that are significantly 
different from other data values, but we have not provided a way of 
finding them. This will be discussed in the next section. Regardless, we 
can see that 5 million is significantly different than 30 thousand in the 
above example. 


e Skew: When a data set has outliers, the outliers have the potential to skew 
the mean. In the above example, the centre of the data is 30,000, but the 
mean is 129,400. Thus the outlier of 5 million is pulling the mean up. 
That is, it is skewing the centre value by pulling it to the right on the 
number line. 


Comparing measures of centre 


Above we have described how to find each of the measures of centre. But how 
do you choose which measure of centre to use in which situation? One option is 
to provide all three measures of centre, but sometimes this can be 
overwhelming to the audience. Instead you want to pick the best one that best 
describes that data. The following are some general guidelines for choosing the 
best measure of centre. 


The mean is often the best measure of centre to use because it is the most well- 
known and familiar of the measures of centre. It is also the only measure of 
centre that is computed using all of the sample values. But the mean is 
susceptible to outliers. As was seen in [link], if there is an outlier, the mean can 
be pulled in one direction away from the centre. 


Outliers are any data value that are significantly different from the other data 
values. In [link], the outlier is 5 million as it is significantly higher than the 
other data values. We will discuss how to find outliers in the section 2.3 
(Boxplots). 


If there is an outlier in the data set that is skewing the mean, the best measure 
of centre to use is the median as it is not susceptible to outliers. 


But be careful: The presence of outliers does not necessarily mean that the 
median is the best measure of centre. Here are a couple of examples where this 
is the case: 


1. Suppose there are 200 data values in a sample and one data value is an 
outlier, then the mean will most likely not be affected by the outlier. 

2. Suppose there is a data set that has outliers, but one is a high outlier and 
one is a low outlier. Then the outliers may balance out and not affect the 


mean. 


The mode is best used for categorical data, but can sometimes be used for 
quantitative data. For example, in [link], the mode would be a good measure of 
centre because the majority of data values are the same. 


In [link], since there are no outliers, the mean is the best measure of centre to 
use. In [link], since there is an outlier (5 million) and the mean and median are 


quite different, the median is the best measure of centre to use. 


The following tables compare the measures of centre. 


Measure How Common 
Mean most familiar 
Median commonly used 
Mode sometimes used 
Measure Every Score Used 
Mean yes 

Median no 


Mode no 


Measure Affected by Outliers 


Mean yes 
Median no 
Mode no 


How to mislead with averages 


Consider the following situation: " As you arrive at an open house in your 
preferred new home location, a neighbour comes up to you while he is walking 
his dog. “This is a great neighbourhood to live in! The average income in this 
neighbourhood is $60,000,” he tells you. You are pleased to hear how affluent 
the community is. A year after you’ ve moved into your new home, the same 
neighbour comes to your door and asks you to sign a petition. “The city is 
overvaluing the homes in our neighbourhood again, which means more taxes. 
The average income in this neighbourhood is $20,000. We can’t afford these 
increases.” You dutifully sign the petition because you don’t want to pay more 
taxes, but you’re also confused. Wasn’t the average income a lot higher last 
year? What happened? Is your neighbour a liar? " In this example, there are 
many different possible scenarios that could explain the discrepancy. But no 
matter what the scenario is, the neighbour is picking his statistics to fit his 
situation. 


One scenario: The neighbour may be picking and choosing which measure of 
centre to use. Suppose that most people in the neighbourhood make around 
$20,000 a year, but there are a few people who live on the street with the super 
nice view who make $300,000 a year. Then in the first case, when he says the 
average income is $60,000, he has used the mean which has been pulled higher 
by the outliers of $300,000. He chose to use the mean to make the 
neighbourhood look more affluent than it really is. 


But when he wanted to make the argument that the neighbourhood wasn’t as 
affluent and should be in a lower tax bracket, he changed which measure of 
centre to use. Instead he may have the used the median or mode because they 
aren’t influenced by the outliers. 


Another scenario: The neighbour may be choosing how he defines income to 
help make his point. In the first case, he may have only used those who are 
employed to come up with the average salary. While in the second case, he may 
have used all adults in the neighbourhood including students living with their 
parents, stay-at-home parents, retired people or people out of work. Their 
incomes may be very low or non-existent which would skew the average to 
being lower. In this scenario, he may be using the same measure of centre, but 
is picking what he means by income to get the results he wants. 


There are other possible scenarios. Can you think of any? 


Skew 


As has been noted above, if there are outliers in a data set, this can cause the 
mean to be pulled up or down (i.e. be either higher than expected or lower than 
expected) by these outliers. Outliers don't have to be present for this to happen. 
Essentially, any time that there are data values that cause the mean and median 
to be significantly different, then we say the data is skewed. 


e If the mean is significantly larger than the median and the histogram has a 
long tail on the right, then the data is right skewed or positively skewed. 

e If the mean is significantly smaller than the median and the histogram has 
a long tail on the left, then the data is left skewed or negatively skewed. 

e If the mean and the median are approximately the same and the histograms 
has balanced tails, then the data is symmetric. 


Examples of skewness and symmetry 


Senedd he oY rm Shu nial 
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These are "perfect" examples of skewness and symmetry. In reality, there 


may be multiple modes or the mean and median will be similar but not 
equal. These are provided to give an example. 


Measures of variation 


An important characteristic of any set of data is the variation in the data. In 
some data sets, the data values are concentrated closely near the mean; in other 
data sets, the data values are more widely spread out from the mean. There are 
five measures of variation: range, standard deviation, variance, interquartile 
range and coefficient of variation. 


The range is the easiest to calculate. It is found by subtracting the maximum 
value in the data set from the minimum value in the data set. Though the range 
is easy to calculate, it is very much affected by outliers. 


The interquartile range will be discussed in the section on box plots (section 
2.3). 


The most common measure of variation, or spread, is the standard deviation. 
The standard deviation measures how far data values are from their mean, on 
average. 


Note: 

Variation within a sample vs. variation between samples 

When talking about variable or variability in statistics, there are two different 
kinds: variation within a sample and variation between samples. 

When we discuss finding the standard deviation, range or any measure of 
variation of a sample, we are discussing variation within a sample. In this case, 
we are looking at how the data values vary from each other. Most of the time, 
when we talk about variation this is what we are talking about. 

We can also talk about how much different samples vary from each other. For 
example, we could take multiple samples and find the sample mean of each 
sample. If we talk about how much the means vary from each other, we are 


discussing variation between samples. We will discuss this specific type of 
variation in Chapter 6. 

The law of large numbers saws that, for random samples, as the sample size 
increases, then the sample will more closely resemble the population. For 
example, as the sample size increases, the sample standard deviation will 
approach the population standard deviation. Thus, the variation within the 
sample will more closely mimic the variation within the population as the 
sample size increases. But as the sample size increases, the sample means will 
approach the population mean. Thus, there will be less variation between the 
sample means. This means that the variation between samples decreases, as 
the sample size increases. When we discuss sampling variability, we are 
discussing variation between samples. 

For this chapter, we are focusing on variation within a sample. 


The standard deviation (and variance) 


¢ provides a numerical measure of the overall amount of variation in a data 
set, and 

e can be used to determine whether a particular data value is close to or far 
from the mean. 


The standard deviation provides a measure of the overall variation in a data set 


The standard deviation is small when the data are all concentrated close to the 
mean, exhibiting little variation or spread. The standard deviation is larger 
when the data values are more spread out from the mean, exhibiting more 
variation. 


Suppose that we are studying the amount of time customers wait in line at the 
checkout at supermarket A and supermarket B. It is known that the average 
wait time at both supermarkets is about five minutes. At supermarket A, 
though, the standard deviation for the wait time is two minutes; at supermarket 
B the standard deviation for the wait time is four minutes. 


Because supermarket B has a higher standard deviation, we know that there is 
more variation in the wait times at supermarket B. Overall, wait times at 
supermarket B are more spread out from the average; wait times at supermarket 


A are more concentrated near the average. This means that at supermarket B, 
you have a greater chance of having a short wait time, but also a greater chance 
of having a long wait time, compared to supermarket A. That means the wait 
times are more volatile at supermarket B. On the other hand, you will be 
waiting about the same amount of time at supermarket A. That means there are 
more consistent waits times at supermarket A. 


One way, we could summarize the supermarket situation is as follows: 


e A typical wait time at supermarket A is 5 minutes give or take 2 minutes. 
This means that someone typically has to wait 3 to 7 minutes in the 
checkout line. 

e A typical wait time at supermarket B is 5 minutes give or take 4 minutes. 
This means that someone typically has to wait 1 to 9 minutes in the 
checkout line. 


Here the term “typical” means common, normal. So normally people will wait 
between 3 to 7 minutes at supermarket A, but there will be some people who 
only wait 2 minutes and some who wait 10 minutes at the checkout. That is, the 
typical range only provides a sense of what is going on in the middle of the 
data, but there are values occurring outside of that range. 


Note:For the typical value, you can use any measure of centre. But for the give 
or take value, you have to use standard deviation. No other measure of 
variation works. 


Calculating the Standard Deviation 


Note:The following explains how to calculate the standard deviation by hand. 
We will be using computer software to do this. Thus it is not important to 
know this section in detail, but it is helpful to know the basics of how the 
standard deviation is calculated to help understand what the standard deviation 
is. 


If x is a number, then the difference "x — mean" is called its deviation. In a data 
set, there are as many deviations as there are items in the data set. The 
deviations are used to calculate the standard deviation. If the numbers belong to 
a population, in symbols a deviation is x — 1. For sample data, in symbols a 
deviation is x — z. 


The procedure to calculate the standard deviation depends on whether the 
numbers are the entire population or are data from a sample. The calculations 
are similar, but not identical. Therefore the symbol used to represent the 
standard deviation depends on whether it is calculated from a population or a 
sample. The lower case letter s represents the sample standard deviation and the 
Greek letter o (sigma, lower case) represents the population standard deviation. 
If the sample has the same characteristics as the population, then s should be a 
good estimate of o. 


To calculate the standard deviation, we need to calculate the variance first. The 
variance is the average of the squares of the deviations (the x — x values for 
a sample, or the x — p! values for a population). The symbol o? represents the 
population variance; the population standard deviation o is the square root of 
the population variance. The symbol s* represents the sample variance; the 
sample standard deviation s is the square root of the sample variance. You can 
think of the standard deviation as a special average of the deviations. 


If the numbers come from a census of the entire population and not a sample, 
when we calculate the average of the squared deviations to find the variance, 
we divide by N, the number of items in the population. If the data are from a 
sample rather than a population, when we calculate the average of the squared 
deviations, we divide by n — 1, one less than the number of items in the sample. 


Formulas for the Sample Standard Deviation 


~\2 
ae 
e For the sample standard deviation, the denominator is n - 1, that is the 
sample size — 1. 


Formulas for the Population Standard Deviation 


e For the population standard deviation, the denominator is N, the number of 
items in the population. 


Since the standard deviation is found by square rooting something, the standard 
deviation is always positive or zero. 


Since the variance is the square of the standard deviation, it is not helpful as a 
descriptive statistic. For example, if you are looking at the weights of 
basketballs in kg, then the standard deviation will be in kg, while the variance 
will be in kg42. Thus the variance is meaningless when trying to interpret the 
variation in data. It is helpful later on in statistics, but at this point it is not. 


Example: 

In a fifth grade class, the teacher was interested in the average age and the 
sample standard deviation of the ages of her students. The following data are 
the ages fora SAMPLE of n = 20 fifth grade students. The ages are rounded to 
the nearest half year: 

ees Fro bopts Peo ee 0 eon 0 i eam 1G ta HO cs Stet WO S| i eg Pca lsd Lh et ha ge Bl Ls 
Pes abies: 

Equation: 


9 + 9.5(2) + 10(4) + 10.5(4) + 11(6) + 11.5(3) 


= 10.52 
30 0.525 


a 
The average age is 10.53 years, rounded to two places. 
The variance may be calculated by using a table. Then the standard deviation 
is calculated by taking the square root of the variance. We will explain the 
parts of the table after calculating s. 


(Freq.) 


: (Deviations?) 


Data Freq. Deviations Deviations 


(Freq.) 


Data Freq. Deviations Deviations” (Deviations?) 
X f (x— x) (x- 2)? (f(x — x)? 
9 i 9— 10.525 = (i525 )\5— 1 x 2.325625 
=1525 2.325625 = 2.325625 
95 5 9.5 — 10.525 (1025)2 = 2 x 1.050625 
== i025 1.050625 = 2.101250 
10 4 10 — 10.525 (052 5)2— 4 x 0.275625 
== 0525 0.275625 O25 
10.5 — 
(-0.025)* = 4 x 0.000625 
10.5 4 = a 
Ae 0.000625 = 0.0025 
fe 6 11 — 10.525 (0.475)? = 6 x 0.225625 
=0:475 0.225625 = 35375 
flee 
‘ (0.975)? = 3 x 0.950625 
Be | | ee = 0.950625 = 2.851875 


The total is 
9.7375 


The sample variance, s’, is equal to the sum of the last column (9.7375) 
divided by the total number of data values minus one (20 — 1): 

sg? = £8 = 0.5125 

The sample standard deviation s is equal to the square root of the sample 
variance: 


s = V0.5125 = 0.715891, which is rounded to two decimal places, s = 0.72. 


Explanation of the standard deviation calculation shown in the table 


The deviations show how spread out the data are about the mean. The data 
value 11.5 is farther from the mean than is the data value 11 which is indicated 
by the deviations 0.97 and 0.47. A positive deviation occurs when the data 
value is greater than the mean, whereas a negative deviation occurs when the 
data value is less than the mean. The deviation is —1.525 for the data value nine. 
If you add the deviations, the sum is always zero. (For [link], there are n = 20 
deviations.) So you cannot simply add the deviations to get the spread of the 
data. By squaring the deviations, you make them positive numbers, and the sum 
will also be positive. The variance, then, is the average squared deviation. 


The variance is a squared measure and does not have the same units as the data. 
Taking the square root solves the problem. The standard deviation measures the 
spread in the same units as the data. 


Notice that instead of dividing by n = 20, the calculation divided by n — 1 = 20 
— 1 =19 because the data is a sample. For the sample variance, we divide by 
the sample size minus one (n — 1). Why not divide by n? The answer has to do 
with the population variance. The sample variance is an estimate of the 
population variance. Based on the theoretical mathematics that lies behind 
these calculations, dividing by (n — 1) gives a better estimate of the population 
variance. 


The standard deviation, s or o, is either zero or larger than zero. When the 
standard deviation is zero, there is no spread; that is, the all the data values are 
equal to each other. The standard deviation is small when the data are all 
concentrated close to the mean, and is larger when the data values show more 
variation from the mean. When the standard deviation is a lot larger than zero, 
the data values are very spread out about the mean; outliers can make s or 0 
very large. 


Coefficient of variation 


The standard deviation is a very good measure of variation, but when 
comparing two data sets it is not always the best. In particular, if the means of 
the two data sets are different. Suppose you are comparing the yearly salaries 
(excluding bonuses) of junior employees versus CEOs at oil and gas companies 
around Alberta. The yearly salaries for the junior employees will be 
significantly smaller than the CEOs. Let’s say the average salary for junior 


employees is $45,000 while for CEOs is $500,000. Now suppose that the 
standard deviation for both groups is $50,000. If we only looked at the standard 
deviation, we might say that the variation in both groups is the same. But really 
variation of $50,000 when the average salary is $45,000 is quite a bit more than 
for a salary of $500,000. That is, there is more relative variation in the junior 
employees’ salary. The standard deviation doesn’t capture this difference. But 
the coefficient of variation does and is a measure of relative variation. That is, 
it takes into account that bigger data values might have a larger standard 
deviation, but that doesn’t mean it has larger variation. 


The coefficient of variation is found by expressing the standard deviation as a 
percentage of the mean: 
Equation: 


(100%) 


Coefficient of Variation = 


RI] & 


In the above example, the coefficient of variation would be: 
Equation: 


50,000 
mover 


CofV for Junior employees = (100%) = 111.1% 


Equation: 


50,000 


CofV for CEOs = (100%) = 1% 


The larger the coefficient of variation, the larger the relative variation. Thus, as 
a measure of relative variation, the junior employees have significantly more 
relative variation (111.11%) compared to the CEOs (1%). 


Here are some points about the coefficient of variation: 


e The coefficient of variation is not affected by multiplicative changes of 
scale. 

¢ The coefficient of variation is used to compare variation between data 
sets. This is very important to remember. For multiple data sets, if the 


means are the same, you can compare the standard deviations. BUT if the 
means are different, you MUST use the coefficient of variation of compare 
the variation in the data sets. 

e If the standard deviation is larger than the mean, the coefficient of 
variation is bigger than 100%. 


Measure When to use 


The range is rarely the best measure of variation to use. 


Range one : ae 
6 But it is a good quick calculation of variation. 
Similar to the mean, this is the most common measure 
of variation. Also, it is derived from the mean. 
Standard Therefore, if your best measure of centre is the mean, 
deviation then the standard deviation is a good complement to it. 


Further, it is best used when finding the variation for 
one data set. 


As it the square of the standard deviation, it is NEVER 
Variance the best measure of variation to use. It is helpful in later 
topics in statistics though. 


This is a not very well known measure of variation, but 
it is helpful in describing the range for middle 50% of 
Interquartile the data values. Further, it is based on measures of 
range location. Therefore, if your best measure of centre is the 
median, then the IQR is a good complementary measure 
of variation. 


This is not well known, but it is useful for giving a 
Coefficient context free interpretation of variation. It is the best 
of variation measure to use when comparing the variations of two or 
more data sets that have different measures of centre. 


When to use which measure of variation 


Example: 

Suppose you are looking at two companies and each company has 24 
employees. At one company, everybody except the CEO makes $30,000. The 
CEO makes $490,000. Thus, the data values would be 

$30,000; $30,000; $30,000; $30,000; $30,000; ... ;490,000 

The second company has an interesting policy. Everybody who starts at the 
company makes $30,000 a year, but as soon as someone else gets hired, they 
get paid $20,000 more. They only hire one person at a time. So, the first 
person who was hired started at $30,000, then when a second person got hired, 
the first person’s salary was raised to $50,000. When a third person got hired, 
the first person’s salary was raised to $70,000 while the salary of the second 
person hired was raised to $50,000. This has been done 23 times. Therefore, 
their data values (i.e. salaries) would look like this: 

$30,000 $50,000; $70,000; $90,000; $110,000; ... $490,000 

Without doing any calculations, we can see that company one has fairly 
consistent salaries except for the CEO. While company two has salaries that 
are more spread out. 

The following table provides the count (i.e. sample size), mean, and the 
measures of variation for the two companies. 


Company One Company Two 
Count 24 24 
Mean 49,166.67 260,000.00 
Range 460,000 460,000 


Population standard deviation 91,820.10 138,443.73 


Coefficient of Variation 190.98% 54.39% 


In the table above, notice that the range is the same for the two data sets. If we 
only looked at the range, this would give a false sense that the amount of 
variation in the two data sets is the same, but we know it isn’t. 

The standard deviation is measuring how much, on average, the data values 
vary from the mean. For company one, 23 of the 24 data values deviate the 
same amount from the mean ($49,166.67 — $30,000 = $19,166.67) with only 
the $490,000 deviating a large amount from the mean. 

For company two, two data values only deviate by only $10,000 ($250,000 
and $270,000) while two data values deviate by a whopping $230,000 
($30,000 and $490,000). 

In company one, 23 out of 24 data values deviate by less than $20,000. But for 
company two, only 2 out of 24 deviate by less than $20,000. This suggests that 
company one will have a smaller standard deviation than company two 
because there is less average deviation. This is supported by MegaStat, which 
shows that the population standard deviation for company one is $91,920.10 
versus company two, which has a population standard deviation of 
$138,443.73. 

Notice that even though company one has an outlier (the CEO’s salary), the 
standard deviation is less than company two. That is, the average variation 
from the mean is less for company one. Thus, the presence of an outlier does 
not necessarily result in a larger standard deviation. 

The story is different when we look at the coefficient of variation. For 
company one, it is 190.98%. While for company two, it is 54.39%. This means 
that company one has larger relative variation than company two. This is 
because company two has a higher mean than company one and thus the 
variation, relative to the mean, isn’t as large as it is in company one. 

In this situation, the best measure of variation to use would be the coefficient 
of variation as we are comparing two data sets with two different means. 
Based on this, company one has larger relative variation than company two. 
Notice that variance is not discussed here. As stated above, the variance is the 
square of the standard deviation. Therefore, the units for variance in this 
example would be $2, which makes no sense. Again, variance is not a useful 
descriptive statistic. 


Note: 


Common Mistake 

Variation and variance might seem like the same word but they aren’t. 
Variation is a general term used to discuss how much the data values vary 
from each other, how much spread there is in the data, how consistent the data 
is, how volatile or risky the data is, and how much deviation there is in the data 
values. It is an umbrella term. Variance is a specific type of variation. It 
specifically refers to the square of the standard deviation. Therefore, it is 
incorrect to say, “There is a lot of variance in the data” or “The best measure 
of variance is ...”. 


Optional section: Comparing Values from Different Data Sets 


The standard deviation is useful when comparing data values that come from 
different data sets. If the data sets have different means and standard deviations, 
then comparing the data values directly can be misleading. 


¢ For each data value, calculate how many standard deviations away from 
its mean the value is. 

e Use the formula: value = mean + (#ofSTDEVs)(standard deviation); solve 
for #ofSTDEVs. 

‘ 4ofSTDEVs = value — mean 


standard deviation 
¢ Compare the results of this calculation. 


#ofSTDEVs is often called a "z-score"; we can use the symbol z. In symbols, 
the formulas become: 


Sample x=x+7Zs c= 


Population xL=pt+zo C= 


Example: 
Exercise: 


Problem: 
Two students, John and Ali, from different high schools, wanted to find 


out who had the highest GPA when compared to his school. Which 
student had the highest GPA when compared to his school? 


School Mean School Standard 
Student GPA GPA Deviation 
John 2.85 3.0 0.7 
Ali Wee 80 10 


Solution: 


For each student, determine how many standard deviations (4ofSTDEVs) 
his GPA is away from the average, for his school. Pay careful attention to 
signs when comparing and interpreting the answer. 

z =# of STDEVs= __value=mean _ __ @= hh 


standard deviation oO 


Fomiehne7 =o, ole Vig. teat 


ee ef See 
For Ali, z = #ofSTDEVs = ~~ = —0.3 
John has the better GPA when compared to his school because his GPA is 
0.21 standard deviations below his school's mean while Ali's GPA is 0.3 
standard deviations below his school's mean. 


John's z-score of —0.21 is higher than Ali's z-score of —0.3. For GPA, 
higher values are better, so we conclude that John has the better GPA 
when compared to his school. 


Note: 
Try It 
Exercise: 


Problem: 
Two swimmers, Angie and Beth, from different teams, wanted to find out 


who had the fastest time for the 50 meter freestyle when compared to her 
team. Which swimmer had the fastest time when compared to her team? 


Time Team Team Standard 
Swimmer (seconds) Mean Time Deviation 
Angie 26.2 ie 0.8 
Beth AYES 30.1 1.4 


Solution: 
For Angie: z = 2022 =-1.25 


For Beth: z = a =-2 


Distributions 


Now that we have learned about determining shape (histogram), centre (mean, 
median or mode), and variation (standard deviation, coefficient of variation and 
range), we can now describe the distribution of a data set. 


In [link], we examined the salaries for two different companies. 


Though we have not done the histogram for either of these data sets, we can 
imagine what they will look like to determine the shape. Company A will have 
one peak at $30,000 with an outlier at $490,000. This will make it skewed to 
the right. For Company B each data value has the same frequency, which makes 
the data uniform. 


For company A, we would describe the distribution of salaries to be skewed to 
the right(shape), centred at $49,166.67 (mean) and have variation of $91,820.10 
(standard deviation). 


For company B, we would describe the distribution of salaries to be 
uniform(shape), centred at $260,000 (mean) and have variation of $138,443.73 
(standard deviation). 
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Chapter Review 


The mean and the median can be calculated to help you find the "center" of a 
data set. The mean is the best estimate for the actual data set, but the median is 
the best measurement when a data set contains several outliers or extreme 
values. The mode will tell you the most frequently occuring datum (or data) in 
your data set. The mean, median, and mode are extremely helpful when you 
need to analyze your data, but if your data set consists of ranges which lack 
specific values, the mean may seem impossible to calculate. However, the mean 
can be approximated if you add the lower boundary with the upper boundary 


and divide by two to find the midpoint of each interval. Multiply each midpoint 
by the number of values found in the corresponding range. Divide the sum of 
these values by the total number of data values in the set. 


The standard deviation can help you calculate the spread of data. There are 
different equations to use if are calculating the standard deviation of a sample 
or of a population. 


e The Standard Deviation allows us to compare individual data or classes to 
the data set mean numerically. 


aa z)” / f(e—2)” 
es= (pai da ors = dase? is the formula for calculating the 


standard sein of a sample. To calculate the standard deviation of a 
population, we would use the population mean, p, and the formula o = 


| orgy SHO 


Use the following information to answer the next three exercises: The following 
data show the lengths of boats moored in a marina. The data are ordered from 


smallest to largest: 
161719202021232425252526262727272829303233333435373940 
Exercise: 


Problem: Calculate the mean. 


Solution: 

Mean: 1636-17 4:19 20:20 2 3 ed 2 2 25 Fe 26 26 
27 #27 £27 +28 +29 +30 + 32:4+:33 + 33 + 34435 + 374739440 = 
4730: 

738 _ 

a7 = 2/.33 


Exercise: 


Problem: Identify the median. 


Solution: 


Median = 27 
Exercise: 
Problem: Identify the mode. 


Solution: 


The most frequent lengths are 25 and 27, which occur three times. Mode = 
23,27 


Use the following information to answer the next three exercises: Sixty-five 
randomly selected car salespersons were asked the number of cars they 
generally sell in one week. Fourteen people answered that they generally sell 
three cars; nineteen generally sell four cars; twelve generally sell five cars; nine 
generally sell six cars; eleven generally sell seven cars. Calculate the following: 
Exercise: 


Problem: sample mean = x = 

Solution: 

Mean = (14*3+19*4+12*5+9*6+11*7)/65 = 4.75 
Exercise: 

Problem: median = 

Solution: 

4 
Exercise: 

Problem: mode = 

Solution: 


Mode = 4 (occurs 19 times) 


Exercise: 
Problem: 
The following data are the distances between 20 retail stores and a large 
distribution center. The distances are in miles. 
29; 37; 38; 40; 58; 67; 68; 69; 76; 86; 87; 95; 96; 96; 99; 106; 112; 127; 
145; 150 


Use a computer to find the standard deviation and round to the nearest 
tenth. 


Solution: 


Ss = 34.5 


Bringing It Together 


Exercise: 


Problem: 


Javier and Ercilia are supervisors at a shopping mall. Each was given the 
task of estimating the mean distance that shoppers live from the mall. 
They each randomly surveyed 100 shoppers. The samples yielded the 
following information. 


Javier Ercilia 
x 6.0 km 6.0 km 
S 4.0 km 7.0 km 


a. How can you determine which survey was correct ? 


b. Explain what the difference in the results of the surveys implies about 
the data. 

c. If the two histograms depict the distribution of values for each 
supervisor, which one depicts Ercilia's sample? How do you know? 


(a) (b) 


Solution: 


a. It is difficult to determine which survey is correct. Both surveys 
include the same number of shoppers and the shoppers were 
randomly selected. We could look at how the random selection was 
done to see if one of the sampling techniques would result in a more 
representative sample. But if they used the same sampling technique, 
there is no way to know which sample is right. The only way would 
be to take another, larger sample and see which of the two 
supervisor's samples most closely matches that sample. But really we 
expect there to be sampling variability so it is not really an 
appropriate question to ask which is "correct". 

b. Since the mean is the same for both samples, this suggests that it is 
fair to say that on average shoppers travel 6.0 km to the mall. But the 
standard deviations are different. This suggests that it is not yet clear 
how much variation there is from the 6.0km. 

c. Ercilia's data has a larger standard deviation. Therefore, on average, 
the data needs to be more spread out from the mean than Javier's. 
This suggests (b) is the answer. 


Use the following information to answer the next three exercises: We are 
interested in the number of years students in a particular elementary statistics 
class have lived in California. The information in the following table is from 
the entire section. 


Number of Number of 


years Frequency years Frequency 
if 1 22 1 

14 3 23 1 

15 1 26 1 

18 1 40 2 

19 4 42 2 

20 3 

Total = 20 
Exercise: 


Problem: What is the mode? 


a. 19 

b. 19.5 

c. 14 and 20 
d. 22.65 


Solution: 
Mode = 19 (occurs 4 times) 
Exercise: 
Problem: Is this a sample or the entire population? 
a. sample 


b. entire population 
c. neither 


Solution: 


b 
Exercise: 


Problem: 


A survey of enrollment at 35 community colleges across the United States 
yielded the following figures: 


6414; 1550; 2109; 9350; 21828; 4300; 5944; 5722; 2825; 2044; 5481; 
5200; 5853; 2750; 10012; 6357; 27000; 9414; 7681; 3200; 17500; 9200; 
7380; 18314; 6557; 13713; 17768; 7493; 2771; 2861; 1263; 7285; 28165; 
5080; 11622 


a. Organize the data into a chart with six intervals of equal width. Label 
the two columns "Enrollment" and "Frequency." 

b. Construct a histogram of the data. 

c. What is the shape of the data? What does the shape tell you about the 
enrollment at these community colleges? 

d. What is the best measure of centre for this data and why? State the 
measure. 

e. What is the best measure of variation for this data and why? State the 
measure. 

f. If you were to build a new community college, what is the typical 
range for the enrollment? Why would this information be helpful? 
What caveats would you want to think about when you look at this 
typical range? 


Solution: 


qa. Enrollment Frequency 


Enrollment Frequency 


0-4999 10 
5000-9999 16 
10000-14999 2 
15000-19999 2 
20000-24999 1 
25000-29999 2 


b. Histogram for enrollment at community colleges. 


Histogram 


Percent 


Enrollment 


c. The shape is skewed to the right which means that there a few 
community colleges that have greater enrollment compared to most 
of the other colleges in the sample. 


d. 


Oo 


Ph 


Since the mean (8628.74) is being skewed (as it is larger than the 
median of 6,414), the median is the best measure of centre. 


. Since we are only looking at one data set, the standard deviation is a 


good measure of variation. It is 6,943.88. 

The typical range is 6,414 +/- 6,943.88 = -529.88 to 13,357.88. As 
there can't be negative students enrolled, the typical range is 0 
students to 13,357.88. Though there could be multiple caveats, one 
concern is the rather large variation in the data. This means that 
community colleges have very different enrollment rates. Perhaps 
looking at community colleges that are similar to the one I would like 
to open would be more beneficial as that population would be more 
representative of my community college. 


Exercise: 


Problem: 


You work for a soda pop company that is producing a new label for their 
Asian market. Three different labels your company is considering are the 
same, except the colours are different. The colour choices are blue, green 
and orange. 


To determine which label consumers prefer, focus groups were done. One 
such focus group asked 15 participants to rate the cans from 1 to 10. A 
score of 1 means they hated the label and 10 means they loved the label. 
The results follow. 


Participant Blue Label Green Label Orange Label 


1 


2 


1 10 6 
4 8 7 
2 9 Z 


s) 1 8 6 
6 1 7 7 
7 1 3 7 
8 4 9 8 
9 1 10 9 
10 7 4 6 
11 4 7 6 
12 fs) 6 7 
13 6 9 8 
14 4 4 6 
15 6 8 7 


Which label would you recommend as the new label for the Asian market? 
Support your decision using the data. 


Solution: 


Label 1 is excluded as most people don’t like it. The mean for label 2 and 
label 3 is the same. Label 2 could be considered the better label because 
more people love it than label 3, but more people hate it. Label 3 could be 
considered a better label because the variation is less - nobody hates it, but 
nobody loves it. (Note: Even though you are comparing two data sets, it is 
ok to look only at the standard deviation instead of the coefficient of 
variation in this situation. Why?). 


Choosing label 2 has greater risk (love/hate relationship). Choosing label 3 
has less risk (most people like it). 


Exercise: 


Problem: 


Three publicly traded telecommunications companies reported their 
monthly profit for the last year. The results are presented below. 


Company Company 
A Company B C 
Mean $10,930 $13,000 $34,450 
Median $9,390 $13,500 $34,450 
$13,000 and 
Mode None $20,000 $33,880 
prancarg $4,196 $9,360 $4,116 
deviation 
Range $15,050 $42,150 $16,400 
1. Donna is close to retirement and wants to invest in one of the three 


companies. She doesn’t want to see her investment drop significantly 
as she doesn’t want to see her retirement savings dwindle. Which 
company would you recommend she invest in and why? 

2. What information is missing from the list that you might want to have 
to help you answer the above question? 

3. What information below is not necessary for making this decision? 


Solution: 


Note that this question is about risk, i.e. variation. 


1. Any answer requires that you examine the amount of variation in the 
data set. The coefficient of variation is the best measure to use to 
compare the variation as the means are different. 


Company Company Company 
A B C 


Coefficient of 


aa 38.39% 72% 11.95% 
variation 


2. The information provided is only for one year. It would be helpful to 
know about their changes over more than one year. Quartiles aren’t 
provided. They could help examine the variation as well. 

3. The median and the mode are not relevant as this is a question about 
variation. The mean is only required as it is needed to find the 
coefficient of variation. 


Glossary 


Frequency Table 
a data representation in which grouped data is displayed along with the 
corresponding frequencies 


Mean 
a number that measures the central tendency of the data; a common name 
for mean is 'average.' The term 'mean' is a shortened form of ‘arithmetic 


mean.' By definition, the mean for a sample (denoted by z) is 
Sum of all values in th l ’ 

‘Want ben of sites nthe and the mean for a population (denoted 

b ) fee Sum of all values in the population 

Me ms -t = Number of values in the population ° 


r= 


Median 


a number that separates ordered data into halves; half the values are the 
same number or smaller than the median and half the values are the same 
number or larger than the median. The median may or may not be part of 
the data. 


Midpoint 
the mean of an interval in a frequency table 


Mode 
the value that appears most frequently in a set of data 


Measures of Location and Box Plots -- MRU -- C Lemieux (2017) 


Introduction 


Measures of location help us to understand where data values are located 
relative to other data values. We've already seen a measure of location - the 
median. It tells us what data value is in the middle of the data set. The most 
common measure of position is a percentile . Percentiles divide ordered data 
into hundredths. To score in the 90" percentile of an exam does not mean, 
necessarily, that you received 90% on a test. It means that 90% of test scores 
are the same or less than your score and 10% of the test scores are the same 
or greater than your test score. The median is the 50" percentile 


A special type of percentile are called quartiles. Quartiles divide ordered 
data into quarters. The first quartile, Q;, is the same as the 25th percentile, 
and the third quartile, Q3, is the same as the 75" percentile. The median, M, 
is called both the second quartile and the 50" percentile. 


A visual representation of measures of location is called a box plot. 


In this section, we will learn how to find quartiles and use those quartiles to 
find the interquartile range and outliers. Then we will visually represent this 
information on a box plot. Unlike histograms and bar graphs, box plots 
require the use of numerical summaries. Thus the box plot is a representation 
that combines both visual and numerical summaries of the data. 


Measures of location 


As described in the introduction, a common measure of location are 
percentiles. Percentiles are useful for comparing values. For this reason, 
universities and colleges use percentiles extensively. One instance in which 
colleges and universities use percentiles is when SAT results are used to 
determine a minimum testing score that will be used as an acceptance factor. 
For example, suppose Duke accepts SAT scores at or above the 75" 
percentile. That translates into a score of at least 1220. 


Percentiles are mostly used with very large populations. Therefore, if you 
were to say that 90% of the test scores are less (and not the same or less) than 


your score, it would be acceptable because removing one particular data 
value is not significant. 


The median is a number that measures the "center" of the data. You can think 
of the median as the "middle value," but it does not actually have to be one of 
the observed values. It is a number that separates ordered data into halves. 
Half the values are the same number or smaller than the median, and half the 
values are the same number or larger. For example, consider the following 
data. 

Te 5S 6; 24 Be 9: 106.87 6:33.23-22 103 1 

Ordered from smallest to largest: 

Tels 2722 6°:6.857.2; 828.3095 102107115 


Since there are 14 observations, the median is between the seventh value, 6.8, 
and the eighth value, 7.2. To find the median, add the two values together and 
divide by two. 

Equation: 


6.8 + 7.2 
ar tea, 


The median is seven. Half of the values are smaller than seven and half of the 
values are larger than seven. 


Quartiles are numbers that separate the data into quarters. Quartiles may or 
may not be part of the data. To find the quartiles, first find the median or 
second quartile. The first quartile, Q,, is the middle value of the lower half of 
the data, and the third quartile, Q3, is the middle value, or median, of the 
upper half of the data. To get the idea, consider the same data set: 

lee 2? 2.4 63 6.8;-7227 878.309 10210 115 


Quartiles are numbers that separate the data into quarters. Quartiles may or 
may not be part of the data. To find the quartiles, first find the median or 
second quartile. The first quartile, Q;, is the middle value of the lower half of 
the data, and the third quartile, Q3, is the middle value, or median, of the 
upper half of the data. To get the idea, consider the same data set: 

Te 12: 2-46: 6.8:-7.2: 858.35 9; 107 107 11.5 


The median or second quartile is seven. The lower half of the data are 1, 1, 
2, 2, 4, 6, 6.8. The middle value of the lower half is two. 
Is de 2: 2) a: OF 68 


The number two, which is part of the data, is the first quartile. One-fourth of 
the entire sets of values are the same as or less than two and three-fourths of 
the values are more than two. 


The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of 
the upper half is nine. 


The third quartile, Q3, is nine. Three-fourths (75%) of the ordered data set 
are less than nine. One-fourth (25%) of the ordered data set are greater than 
nine. The third quartile is part of the data set in this example. 


Possible Quartile Positions 


Qi Qa Qs 
y a y a a» y . 
1 2 
y a a» hy i. ae m& 
1 2 3 
la fr a y am» y ——¢ 
1 2 3 
@ = observations less than Q; M& = observations less than Q» but greater than Q, 
M& = observations less than Qs but greater than Qo = observations greater than Qs 


As mentioned in the previous section, the interquartile range is a measure 
of variation. It is a number that indicates the spread of the middle half or the 
middle 50% of the data. It is the difference between the third quartile (Q3) 
and the first quartile (Q;). 


IQR = Q3- Q, 


The IQR can help to determine potential outliers. A value is suspected to be 
a potential outlier if it is less than (1.5)([QR) below the first quartile or 


more than (1.5)([QR) above the third quartile. Potential outliers always 
require further investigation. 


Note: 

NOTE 

A potential outlier is a data point that is significantly different from the other 
data points. These special data points may be errors or some kind of 
abnormality or they may be a key to understanding the data. 


Example: 
Exercise: 


Problem: 
For the following 13 real estate prices, calculate the JQR and determine 
if any prices are potential outliers. Prices are in dollars. 
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 
387,000; 659,000; 529,000; 575,000; 488,800; 1,095,000 
Solution: 
Order the data from smallest to largest. 
114,950; 158,000; 230,500; 387,000; 389,950; 479,000; 488,800; 
529,000; 575,000; 639,000; 659,000; 1,095,000; 5,500,000 
M = 488,800 

230,500 + 387,000 _ 
Q, = SES = 308,750 
Q, = £39,000 = 659,000 — G49 990 


IQR = 649,000 — 308,750 = 340,250 


(1.5)(IQR) = (1.5)(340,250) = 510,375 


1.5([QR) less than the first quartile: Q; — (1.5)(JQR) = 308,750 — 
510,375 = —201,625 


1.5(1QR) more than the first quartile:Q3 + (1.5)/QR) = 649,000 + 
SlO37 5 — UiSo s/s 


No house price is less than —201,625. However, 5,500,000 is more than 
1,159,375. Therefore, 5,500,000 is a potential outlier. 


Example: 
Exercise: 


Problem: 

For the two data sets in the test scores example, find the following: 
a. The interquartile range. Compare the two interquartile ranges. 
b. Any outliers in either set. 

Solution: 


The five number summary for the day and night classes is 


Minimum Qi Median Q3 Maximum 
Day 32 56 74.5 82.5 99 
Night DISS 78 81 89 98 


a. The IQR for the day group is Q3 — Q, = 82.5 — 56 = 26.5 


The IQR for the night group is Q3 — Q; = 89 — 78 = 11 


The interquartile range (the spread or variability) for the day class 
is larger than the night class IQR. This suggests more variation 
will be found in the day class’s class test scores. 

b. Day class outliers are found using the IQR times 1.5 rule. So, 


© Q; - IQR(1.5) = 56 — 26.5(1.5) = 16.25 
© Qs + IQR(1.5) = 82.5 + 26.5(1.5) = 122.25 


Since the minimum and maximum values for the day class are 
greater than 16.25 and less than 122.25, there are no outliers. 


Night class outliers are calculated as: 


s©O)-JOR (15) = 78—11(05)= 615 
© Q3 + IQR(1.5) = 89 + 11(1.5) = 105.5 


For this class, any test score less than 61.5 is an outlier. Therefore, 
the scores of 45 and 25.5 are outliers. Since no test score is greater 
than 105.5, there is no upper end outlier. 


Interpreting Percentiles, Quartiles, and Median 


A percentile indicates the relative standing of a data value when data are 
sorted into numerical order from smallest to largest. Percentages of data 
values are less than or equal to the pth percentile. For example, 15% of data 
values are less than or equal to the 15" percentile. 


e Low percentiles always correspond to lower data values. 
e High percentiles always correspond to higher data values. 


A percentile may or may not correspond to a value judgment about whether it 
is "good" or "bad." The interpretation of whether a certain percentile is 
"good" or "bad" depends on the context of the situation to which the data 
applies. In some situations, a low percentile would be considered "good;" in 


other contexts a high percentile might be considered "good". In many 
situations, there is no value judgment that applies. 


Understanding how to interpret percentiles properly is important not only 
when describing data, but also when calculating probabilities in later chapters 
of this text. 


Note: 

Guideline 

When writing the interpretation of a percentile in the context of the given 
data, the sentence should contain the following information. 


e information about the context of the situation being considered 

e the data value (value of the variable) that represents the percentile 

e the percent of individuals or items with data values below the percentile 

e the percent of individuals or items with data values above the 
percentile. 


Example: 
Exercise: 


Problem: 


On a timed math test, the first quartile for time it took to finish the 
exam was 35 minutes. Interpret the first quartile in the context of this 
situation. 


Solution: 


e Twenty-five percent of students finished the exam in 35 minutes or 
less. 

e Seventy-five percent of students finished the exam in 35 minutes 
or more. 


e A low percentile could be considered good, as finishing more 
quickly on a timed exam is desirable. (If you take too long, you 
might not be able to finish.) 


Example: 
Exercise: 


Problem: 


On a 20 question math test, the 70 percentile for number of correct 
answers was 16. Interpret the 70" percentile in the context of this 
situation. 


Solution: 


e Seventy percent of students answered 16 or fewer questions 
correctly. 

e Thirty percent of students answered 16 or more questions 
correctly. 

e A higher percentile could be considered good, as answering more 
questions correctly is desirable. 


Note: 
Try It 
Exercise: 


Problem: 


On a 60 point written assignment, the 80" percentile for the number of 
points earned was 49. Interpret the 80" percentile in the context of this 
situation. 


Solution: 


Eighty percent of students earned 49 points or fewer. Twenty percent of 
students earned 49 or more points. A higher percentile is good because 
getting more points on an assignment is desirable. 


Example: 
Exercise: 


Problem: 


At a community college, it was found that the 30" percentile of credit 
units that students are enrolled for is seven units. Interpret the 30" 
percentile in the context of this situation. 


Solution: 


e Thirty percent of students are enrolled in seven or fewer credit 
units. 

e Seventy percent of students are enrolled in seven or more credit 
units. 

e In this example, there is no "good" or "bad" value judgment 
associated with a higher or lower percentile. Students attend 
community college for varied reasons and needs, and their course 
load varies according to their needs. 


Outliers 


Above the idea of potential outliers were discussed. This section will look 
more in depth at how to find outliers and how to categorize them. 


Quartiles can also be used to determine if there are any outliers in a data set. 
To determine if there are outliers, we need to first calculate the inner and 
outer fences. The fences define the boundary between a “normal” data value 
and an “abnormal” data value (or outlier). Any data values that fall between 


the inner fences are normal data values. Any data values that fall outside 
the inner fences are considered outliers. 


The fences are calculated as follows: 

The inner fences are Q, - JQR(1.5) and Q3 + IQR(1.5). 

The outer fences are Q; - JQR(3) and Q3 + IQR(3). 

A mild outlier is any data value between the inner and outer fences. 


An extreme outlier is any data value to the extreme of the outer fence. 


Example: 

Finding outliers 

Sharpe Middle School is applying for a grant that will be used to add fitness 
equipment to the gym. The principal surveyed 15 anonymous students to 
determine how many minutes a day the students spend exercising. The 
results from the 15 anonymous students are shown. 

0 minutes; 40 minutes; 60 minutes; 30 minutes; 60 minutes 10 minutes; 45 
minutes; 30 minutes; 300 minutes; 90 minutes; 30 minutes; 120 minutes; 60 
minutes; 0 minutes; 20 minutes 

The five-number summary is determined to be: Min = 0; Q1 = 20; Med = 40; 
Q3 = 60; Max = 300. 

Are there any students who are exercising significantly more or less than the 
other students? 

To answer this question, we have to determine if there are any outliers. 

To do this, determine the inner fences. 

The IQR is 60-20=40. 

The lower inner fence is Q, - IQR(1.5) = 20 — 40(1.5) = -40$ and the upper 
inner fence is Q3 + IQR(1.5) = 60 + 40(1.5) = 1208. Thus, any student who 
exercises between -40 minutes and 120 minutes is exercising a “normal” 
amount of time (relative to the rest of the students). Since someone can’t 
exercise -40 minutes, this is really 0 minutes to 120 minutes. Therefore, 300 
minutes appears to be an outlier. But is it a mild outlier or an extreme 
outlier? 


To determine if it is mild or extreme, we need to calculate the outer fence. 
We only need the upper outer fence as there are no low outliers (no one 
exercised for less than -40 minutes). The upper outer fence is Q + JQR(3) = 
60 + 40(3) = 1808. If the potential outlier is between 120 and 180 minutes, 
then it is a mild outlier (as it is between the upper inner and outer fences). If 
it is more than 180 minutes, then it is an extreme outlier. In this case, 300 
minutes is an extreme outlier. This means that this student is exercising way 
more than the rest of their classmates! 


Box Plots 


Box plots (also called box-and-whisker plots or box-whisker plots) give a 
good graphical image of the concentration of the data. They also show how 
far the extreme values are from most of the data. 


To construct a box plot, use a horizontal or vertical number line and a 
rectangular box. The smallest and largest data values label the endpoints of 
the axis. The first quartile marks one end of the box and the third quartile 
marks the other end of the box. Approximately the middle 50 percent of the 
data fall inside the box. The "whiskers" extend from the ends of the box to 
the smallest and largest data values. The median or second quartile can be 
between the first and third quartiles, or it can be one, or the other, or both. 
The box plot gives a good, quick picture of the data. 


A box plot is constructed from the five-number summary (the minimum 
value, the first quartile, the median, the third quartile, and the maximum 
value) and, if there are outliers, the fences. We use these values to compare 
how close other data values are to them. 

Example of a box plot 


BoxPlot 


0 5000 10000 15000 20000 25000 30000 
Data 


This is an example of a box plot. The box is in the middle and represents 
50% of the data. The circles on the right represent outliers and the 
dashed lines the fences. The outliers at approximately 22000 and 27000 
are mild outliers, while the outlier at approximately 28500 is an extreme 
outlier. 


To construct a box plot, use a horizontal or vertical number line and a 
rectangular box. The smallest and largest data values label the endpoints of 
the axis. The first quartile marks one end of the box and the third quartile 
marks the other end of the box. The median is represented by a line inside the 
box. The middle 50 percent of the data fall inside the box and the length of 
the box is the interquartile range. 


The "whiskers" extend from the ends of the box to the first data values inside 
the fences. If there are no outliers, this would be minimum and maximum 
values. The outliers are represented by asterisks or dots and fall either 
between the inner and outer fences (mild outlier) or outside the outer fences 
(extreme outlier). 


Consider, again, this dataset. 
Li??? 4668 7.28839 10 1011.5 


From the work done above, we know the five number summary is 1, 2, 7, 9, 
11.5. The IQR is 9-2 = 7. IQR(1.5) is 7*1.5 = 10.5. The lower inner fence is 
Q1-IQR(1.5) = 2-10.5=-8.5 and the upper inner fence is 


Q3+IQR(1.5)=9+10.5 = 19.5. Since no data values are smaller than -8.5 or 
larger than 19.5, there are no outliers in the data set. 


eg ee Se 


a ee Pe al 
1 2 3 4 5 6 7 8 9 10 11 11.5 


The two whiskers extend from the first quartile to the smallest value and 
from the third quartile to the largest value. The median is shown with a 
dashed line. 


Note: 

NOTE 

It is important to start a box plot with a scaled number line. Otherwise the 
box plot may not be useful. 


Example: 

The following data are the heights of 40 students (in inches) in a statistics 
class. 

595608 Ol: 62262: 63°65; 04. b4- 64:765> 6505; ba3.655-00; 60205; 05, 66; 
603907207 66766. 0047 U0 708 702 Os ay i eee 2a nerd aa popes: 
Take this data and input it into Excel. Use the "Text to Columns" function in 
the "Data" menu to separate the data into separate columns. Then copy the 
data, but when you paste it, use paste special to ''Transpose" the data so it is 
all in one column. 

Now use whatever software you are using to find the five-number summary. 


e Minimum value = 59 

¢ Q1: First quartile = 64.75 

e Q2: Second quartile or median= 66 
¢ Q3: Third quartile = 70 

e Maximum value = 77 


Are there outliers? The IQR is 70-64.75 = 5.25. 
IQR(1.5) = 7.875 (don't round until the end) 


The lower inner fence is Q1 - IQR(1.5) = 64.75-7.875 = 56.875. Since the 
minimum value is 59, there are no lower outliers. 

The upper inner fence is Q3 + IQR(1.5) = 70+7.875 = 77.875. Since the 
maximum value is 77, there are no upper outliers. 

You can also use your computer program to create a box plot for the data. 
Box plot of height of 40 students 


BoxPlot of heights of students in statistics class 


—}— 


40 45 50 55 60 65 70 75 80 
Heights (in) 


Note:The titles and labels for a box plot follow the same rules as they do for 
a histogram or a bar graph. 


What does the box plot tell us? 


e Each quarter has approximately 25% of the data. 

e The spreads of the four quarters are 64.75 — 59 = 5.75 (first quarter), 66 
— 64.75 = 1.25 (second quarter), 70 — 66= 4 (third quarter), and 77 — 70 
= 7 (fourth quarter). So, the second quarter has the smallest spread and 
the fourth quarter has the largest spread. 

e Range = maximum value — the minimum value = 77 — 59 = 18, which 
means that from the shortest to the tallest student there is a difference of 
18 inches. 

e Interquartile Range: IQR = third quartile - first quartile = 70 — 64.75 = 
5.25, which means that the middle 50% (middle half) of the data has a 
range of 5.25 inches. This also means the length of the box is 5.25. 


Note: 
Try It 
Exercise: 


Problem: 


The following data are the number of pages in 40 books on a shelf. 
Construct a box plot using computer software, and state the 
interquartile range. 


136 140 178 190 205 215 217 218 232 234 240 255 270 275 290 301 
303 315 317 318 326 333 343 349 360 369 377 388 391 392 398 400 
402 405 408 422 429 450 475 512 


Solution: 


a 


120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 


IQR = 158 


For some sets of data, some of the largest value, smallest value, first quartile, 
median, and third quartile may be the same. For instance, you might have a 
data set in which the median and the third quartile are the same. In this case, 
the diagram would not have a dotted line inside the box displaying the 
median. The right side of the box would display both the third quartile and 
the median. For example, if the smallest value and the first quartile were both 
one, the median and the third quartile were both five, and the largest value 
was seven, the box plot would look like: 


uf 2 3 4 a” 6 7 


In this case, at least 25% of the values are equal to one. Twenty-five percent 
of the values are between one and five, inclusive. At least 25% of the values 


are equal to five. The top 25% of the values fall between five and seven, 
inclusive. 


Example: 

Test scores for a college statistics class held during the day are: 

D9 56 78 55.0 32 90°80 G1 o6 59 45:77 84.5 84 70:72 G82 79 90 

Test scores for a college statistics class held during the evening are: 

98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 79 78 98 90 79 81 25.5 
Exercise: 


Problem: 


a. 


b. 


Find the smallest and largest values, the median, and the first and 
third quartile for the day class. 

Find the smallest and largest values, the median, and the first and 
third quartile for the night class. 


. For each data set, what percentage of the data is between the 


smallest value and the first quartile? the first quartile and the 
median? the median and the third quartile? the third quartile and 
the largest value? What percentage of the data is between the first 
quartile and the largest value? 


. Create a box plot for each set of data. Use one number line for 


both box plots. 


. Which box plot has the widest spread for the middle 50% of the 


data (the data between the first and third quartiles)? What does this 
mean for that set of data in comparison to the other set of data? 


Solution: 
a o Min=32 
2 Oy 98 
0 M=74.5 
© Q3 = 82.5 
o Max = 99 


Min = 25.5 


(e) 


OOi=78 
o M=81 
oO OL es ote) 
o Max = 98 


c. Day class: There are six data values ranging from 32 to 56: 30%. 
There are six data values ranging from 56 to 74.5: 30%. There are 
five data values ranging from 74.5 to 82.5: 25%. There are five 
data values ranging from 82.5 to 99: 25%. There are 16 data values 
between the first quartile, 56, and the largest value, 99: 75%. Night 
class: 


d. 20 30 40 50 60 70 80 90 100 


e. The first data set has the wider spread for the middle 50% of the 
data. The JQR for the first data set is greater than the JQR for the 
second set. This means that there is more variability in the middle 
50% of the first data set. 


Note: 
Try It 
Exercise: 


Problem: 


The following data set shows the heights in inches for the boys in a 
class of 40 students. 


667-66)'67..67,/68; 06: 66) 68,0869 69-09-70) 71 72772-7273: 73: 
74 

The following data set shows the heights in inches for the girls in a 
class of 40 students. 


Gl7Gl 62; 627635. 63; 65; Oo; 65, G5; G6; 6G, bb; G7. 66; bb: 68, 69: 69: 
69 

Construct a box plot using computer software for each data set, and 
state which box plot has the wider spread for the middle 50% of the 
data. 


Solution: 
Heights of boys 


— hh 


Heights of girls 


60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 


IQR for the boys = 4 
IQR for the girls = 5 


The box plot for the heights of the girls has the wider spread for the 
middle 50% of the data. 
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Chapter Review 


The values that divide a rank-ordered set of data into 100 equal parts are 
called percentiles. Percentiles are used to compare and interpret data. For 
example, an observation at the 50" percentile would be greater than 50 
percent of the other obeservations in the set. Quartiles divide data into 
quarters. The first quartile (Q,) is the 25" percentile,the second quartile (Q> 
or median) is 50" percentile, and the third quartile (Q3) is the the 75" 
percentile. The interquartile range, or IQR, is the range of the middle 50 
percent of the data values. The IQR is found by subtracting Q; from Q3, and 
can help determine outliers by using the following two expressions. 


© Qs + IQR(L.5) 
¢ Q, —IQR(1.5) 


Box plots are a type of graph that can help visually organize data. To graph a 
box plot the following data points must be calculated: the minimum value, 
the first quartile, the median, the third quartile, and the maximum value. 
Once the box plot is graphed, you can display and compare distributions of 
data. 

Exercise: 


Problem: 


On an exam, would it be more desirable to earn a grade with a high or 
low percentile? Explain. 


Solution: 


It is better to earn a grade in a high percentile as that means that you 
have done better on the exam relative to your classmates. 


Exercise: 


Problem: 


Mina is waiting in line at the Department of Motor Vehicles (DMV). 
Her wait time of 32 minutes is the 85" percentile of wait times. Is that 
good or bad? Write a sentence interpreting the 85"" percentile in the 
context of this situation. 


Solution: 


When waiting in line at the DMV, the 85" percentile would be a long 
wait time compared to the other people waiting. 85% of people had 
shorter wait times than Mina. In this context, Mina would prefer a wait 
time corresponding to a lower percentile. 85% of people at the DMV 
waited 32 minutes or less. 15% of people at the DMV waited 32 minutes 
or longer. 


Exercise: 


Problem: 


In a study collecting data about the repair costs of damage to 
automobiles in a certain type of crash tests, a certain model of car had 
$1,700 in damage and was in the 90" percentile. Should the 
manufacturer and the consumer be pleased or upset by this result? 
Explain and write a sentence that interprets the 90" percentile in the 
context of this problem. 


Solution: 


The manufacturer and the consumer would be upset. This is a large 
repair cost for the damages, compared to the other cars in the sample. 
INTERPRETATION: 90% of the crash tested cars had damage repair 
costs of $1700 or less; only 10% had damage repair costs of $1700 or 
more. 


Exercise: 


Problem: 


Suppose that you are buying a house. You and your realtor have 
determined that the most expensive house you can afford is the 34" 
percentile. The 34" percentile of housing prices is $240,000 in the town 
you want to move to. In this town, can you afford 34% of the houses or 
66% of the houses? 


Solution: 


You can afford 34% of houses. 66% of the houses are too expensive for 
your budget. INTERPRETATION: 34% of houses cost $240,000 or less. 
66% of houses cost $240,000 or more. 


Exercise: 
Problem: 
Sixty-five randomly selected car salespersons were asked the number of 
cars they generally sell in one week. Fourteen people answered that they 
generally sell three cars; nineteen generally sell four cars; twelve 
generally sell five cars; nine generally sell six cars; eleven generally sell 
seven cars. Construct a box plot for this data. 
Solution: 


BoxPlot of number of cars sold per salesperson 


2 3 4 5 6 Ff 8 
number of cars sold per salesperson 


Exercise: 


Problem: 


Looking at your box plot in the exercise above, does it appear that the 
data are concentrated together, spread out evenly, or concentrated in 
some areas, but not in others? How can you tell? 


Solution: 


More than 25% of salespersons sell four cars in a typical week. You can 
see this concentration in the box plot because the first quartile is equal to 
the median. The top 25% and the bottom 25% are spread out evenly; the 
whiskers have the same length. 


Exercise: 


Problem: 


In a survey of 20-year-olds in China, Germany, and the United States, 
people were asked the number of foreign countries they had visited in 


their lifetime. The following box plots display the results. 
China 


Germany 


United States 


a. In complete sentences, describe what the shape of each box plot 
implies about the distribution of the data collected. 

b. Have more Americans or more Germans surveyed been to over 
eight foreign countries? 

c. Compare the three box plots. What do they imply about the foreign 
travel of 20-year-old residents of the three countries when 
compared to each other? 


Solution: 


a. The shape of China suggests that either every person they surveyed 
except one either visited 0 foreign countries or 5 foreign countries. 
For example, if 30 people were interviewed in China, 29 of them 
have visited no foreign country and one of them has visited 5 
foreign countries OR 29 of them have visited 5 foreign countries 
and one of them has visited no foreign countries. It is unclear 
which way it is going in the box plot. In Germany, 50% of those 
surveyed have visited 8 or less countries. Based on the position of 
the median, this suggests that there are many people in the survey 
who have visited eight countries. This suggests the distribution will 
have a peak at 8 and will be non-symmetric. In the USA, 50% of 
those surveyed have visited 2 or less countries. As there are no 
whiskers, this suggests that 25% of the Americans surveyed have 
visited no foreign countries which suggest a skew to the right for 
the distribution. 

b. 25% of Germans surveyed have been to more than 8 foreign 
countries. It is unclear what the percentage is for Americans but it 
is less than 25%. Therefore, Germany. 

c. Germans in the survey have visited far more countries that 
Americans and the Chinese in the survey. China has the least 
foreign travel. 


Exercise: 


Problem: Given the following box plot, answer the questions. 


a. Think of an example (in words) where the data might fit into the 
above box plot. In 2—5 sentences, write down the example. 

b. What does it mean to have the first and second quartiles so close 
together, while the second to third quartiles are far apart? 


Solution: 


a. Answers will vary. Possible answer: State University conducted a 


b. 


survey to see how involved its students are in community service. 
The box plot shows the number of community service hours logged 
by participants over the past year. 

Because the first and second quartiles are close, the data in this 
quarter is very similar. There is not much variation in the values. 
The data in the third quarter is much more variable, or spread out. 
This is clear because the second quartile is so far away from the 
third quartile. 


Exercise: 


Problem: 


A survey was conducted of 130 purchasers of new BMW 3 series cars, 
130 purchasers of new BMW 5 series cars, and 130 purchasers of new 
BMW 7 series cars. In it, people were asked the age they were when 
they purchased their car. The following box plots display the results. 


eh 


BMW 3 series 


BMW 5 series 


BMW 7 series 


. In complete sentences, describe what the shape of each box plot 


implies about the distribution of the data collected for that car 
series. 


. Which group is most likely to have an outlier? Explain how you 


determined that. 


. Compare the three box plots. What do they imply about the age of 


purchasing a BMW from the series when compared to each other? 


. Look at the BMW 5 series. Which quarter has the smallest spread 


of data? What is the spread? 


. Look at the BMW 5 series. Which quarter has the largest spread of 


data? What is the spread? 


. Look at the BMW 5 series. Estimate the interquartile range (IQR). 


g. Look at the BMW 5 series. Are there more data in the interval 31 to 
38 or in the interval 45 to 55? How do you know this? 

h. Look at the BMW 5 series. Which interval has the fewest data in it? 
How do you know this? 


1. 31-35 
ii. 38-41 
iii. 41-64 


Solution: 


a. Each box plot is spread out more in the greater values. Each plot is 
skewed to the right, so the ages of the top 50% of buyers are more 
variable than the ages of the lower 50%. 

b. The BMW 3 series is most likely to have an outlier. It has the 
longest whisker. 

c. Comparing the median ages, younger people tend to buy the BMW 
3 series, while older people tend to buy the BMW 7 series. 
However, this is not a rule, because there is so much variability in 
each data set. 

d. The second quarter has the smallest spread. There seems to be only 
a three-year difference between the first quartile and the median. 

e. The third quarter has the largest spread. There seems to be 
approximately a 14-year difference between the median and the 
third quartile. 

. IQR ~ 17 years 

g. There is not enough information to tell. Each interval lies within a 
quarter, so we cannot tell exactly where the data in that quarter is 
concentrated. 

h. The interval from 31 to 35 years has the fewest data values. 
Twenty-five percent of the values fall in the interval 38 to 41, and 
25% fall between 41 and 64. Since 25% of values fall between 31 
and 38, we know that fewer than 25% fall between 31 and 35. 


lamp) 


Exercise: 


Problem: 


The following data represents the number of passengers per flight on the 
AirBus from Calgary to Edmonton for 24 flights. 


8, 19, 22, 23, 29, 30, 34, 35, 37, 39, 41, 44, 44, 46, 46, 47, 48, 49, 50, 
o2, 04, 55, 61, 65 


a. Generate the boxplot for this data. 
b. Identify the outliers in the data. Are they low or high outliers? Are 
the extreme or mild outliers? 

c. Interpret the outliers in the context of the question. 

d. What is the IQR? Interpret it in the context of the question. 

e. Which quarter of the data is the most concentrated? The least 
concentrated? 

. What is the five-number summary (minimum, first quartile, 
median, third quartile, maximum)? 


eh 


Solution: 
BoxPlot of number of passengers on AirBus 
o —— | -K—— 
0 10 20 30 40 50 60 70 


number of passengers 


a. 

b. There is one mild low outlier of 8 passengers on a flight. 

c. a) The outlier means that on this flight there were significantly 
fewer passengers (only 8) than there are on other similar flights. 

d. The IQR is 16.25 (from 33 to 49.25). This means that 50% of the 
time, the number of passengers is between 33 and 49.25 on the 
Airbus. This gives us a sense of the amount of variation in the 
number of passengers. 

e. The distance between the median and the third quartile (from 44 to 
49.25) is the least (5.25). This means that these 25% of data values 


are closely packed together. While the distance between the outlier 
and the first quartile is the largest (25 passengers). This means that 
these 25% of the data values are spread out from each other. 

f. a) The five-number summary is: Minimum = 8; First quartile = 33; 
Median = 44; Third quartile = 49.25; Maximum = 65. 


Bringing It Together 


Exercise: 


Problem: 


Santa Clara County, CA, has approximately 27,873 Japanese- 
Americans. Their ages are as follows: 


Age Group Percent of Community 
0-17 18.9 

18-24 8.0 

25-34 22.8 

35-44 15.0 

45-54 13.1 

55-64 11.9 


65+ 10.3 


a. Construct a histogram of the Japanese-American community in 
Santa Clara County, CA. The bars will not be the same width for 
this example. Why not? What impact does this have on the 
reliability of the graph? 

b. What percentage of the community is under age 35? 

c. Which box plot most resembles the information above? 


0 24 34 53 =100 


0 18 34 45 =100 


0 24 25 54 =100 


Solution: 


L = sal 
"Histogram" of ages of Japenese-Americans in 
Santa Clara County 


(Js 5 
10 
| 
0 


0-17 18-24 25-34 35-44 45-54 55-64 


This is technically not a histogram as the bars aren't touching, 
but without the original data this is the best that I could come 
up with unless I drew it by hand! 


b. 49.7% of the community is under the age of 35. 
c. Based on the information in the table, graph (a) most closely 
represents the data. 


Glossary 


Box plot 
a graph that gives a quick picture of the middle 50% of the data 


First Quartile 
the value that is the median of the of the lower half of the ordered data 
set 


Frequency Polygon 
looks like a line graph but uses intervals to display ranges of large 
amounts of data 


Interval 
also called a class interval; an interval represents a range of data and is 
used when displaying large data sets 


Paired Data Set 
two data sets that have a one to one relationship so that: 


e both data sets are the same size, and 
e each data point in one data set is matched with exactly one point 
from the other set. 


Skewed 
used to describe data that is not symmetrical; when the right side of a 
graph looks “chopped off” compared the left side, we say it is “skewed 
to the left.” When the left side of the graph looks “chopped off” 
compared to the right side, we say the data is “skewed to the right.” 
Alternatively: when the lower values of the data are more spread out, we 
say the data are skewed to the left. When the greater values are more 
spread out, the data are skewed to the right. 


